3-D Convolutional Network with Adversarial Training for Video Generation

3-D Convolutional Network with Adversarial Training for Video Generation

1. Abstract

After building image translator powered by CycleGAN, I found a problem that, in generation and translation of videos, traditional model was not capable to deal with temporal dimension, causing generated videos collapse. In this work, referring to 3D-convolutional networks for spatial-temporal data representation, I conduct experiments aiming to find appropriate and powerful neural network structures on generation of coherent video with dynamic object. Till now, I found that three dimensional deep residual network has best performance in extracting, filling and transforming features with temporal information. Based on residual connection, I designed ResNet-V5-A/B/C network with adversarial training which is capable to generate realistic video clips in situation of unbalanced translation.

2. Introduction

2.1 Inspiration

These days, I have read many tutorials and papers about video problem in computer vision with machine learning. Many famous papers like 3-D-convolution, 3-D-ResNet and two-stream network inspired me a lot. Meanwhile, I was guessing that could GANs in my previous work translate videos. One day, I found a clue in CycleGAN’s introduction which was an generated animation about house-to-zebra translation. Author splits every frame from a house recording and translates them side-by-side using a trained CycleGAN. But something strange in his result – there are many dynamic figures on zebra (the real should be fixed) which meant that the network could not get the relation between different frames, but instead, it only focus on spatial information. In other word, the network was not capable to deal the sequence information behind the video. Then I started to explore computer vision problems about temporal dimension.

Video Player
http://35.238.106.135/wp-content/uploads/2018/01/hourse2zebra.mp4
00:01
00:00
00:06
Use Up/Down Arrow keys to increase or decrease volume.

2D-CycleGAN: House-to-zebra (https://arxiv.org/pdf/1703.10593.pdf)

2.2 Three-Dimensional Convolution

In one of methods, the network contain full 3D convolution to cover two spatial dimensions and one temporal dimension. Apparently, with the involvement of another axis, the convolutional kernel can easily extract the relation between spatial features and temporal information. I will use this simple but strong network structure as backbone in this section. Due to original intention, I will test two-stream network and other structures in the future, because, compared to 3D-Conv method, it cannot keep full information between spatial and temporal features when using separate dimensions. So I will mainly focus on 3-dimensional convolutional network to build a productive generative models. Due to limit of dataset, the experiment only focus on dynamic object generation, but I will test dynamic scenes with background mask in the future.

3D-GAN: http://carlvondrick.com/tinyvideo/

3. Basic Model

In this part, I will only train a translative network mainly to test network’s capability of gathering, mapping, reshaping features with temporal information. The results helped to design the generator of a complete generative adversarial model. In these tests, I will use the small dataset to adjust network’s structure and full dataset to evaluate the performance.

3.1 Methodology

I used Sketch Simplify 2D-convolutional translator for reference and adjusted its structure and parameters to fit in 3-D environment. In following experiments, I conducted different structures and different training methods to find the best model. In consideration of training speed, at beginning, I only used small-scale datasets and limited the max number of filters to 32.

Sketch Simplify: http://hi.cs.waseda.ac.jp/~esimo/en/research/sketch/

The diagram above shows the structure of a fully-connected bottleneck-like network for sketch simplicity. In this section, I may test many types of similar structures – residual blocks in the middle, encoder and decoder without flat convolution, ‘fully-residual’ network. All networks have 16 channels for 64-by-64 frames in input layer and identically has 16 channels in output. The number of channel doubles when going to the next layer in down-sampling stage and oppositely decrease two times in up-sampling stage, and keeps the same in middle stage.

3.2 Loss Function

In order to compare the difference of generated video frames and target frames, in this section, the networks were all trained with Gradient Descent optimizer using MSE-loss. Compared to L1-loss, squared loss enables networks to filter more unrelated data which benefits to have less noise in result, as well as train faster. In Trial A-1, I have trained two networks to test both functions. The result showed that frames generated by MSE-loss have less miscellaneous points in first 100 epochs of training and the loss descents faster.

Mean Squared Error

3.3 Bottleneck Translator

Applying basic method mentioned in 3.1, I designed the model having three stages – down-sampling, residual blocks as flat convolution, and up-sampling. These trials will disclose capability of bottleneck-like model in related job. In real tests,  I tested structures containing some convolutional layers as down-sampling stage, transposed-convolutional layers as up-sampling stage, and finally with a convolutional layer for generating frames (RGB filters).

Trials Network Type Downsample Upsample Flat Convolution Middle Filters Middle Features Results
A-1 Residual Bottleneck 4 Convolutional Layers 4 Transposed-Conv Layers 6 ResNet Blocks 256 9x4x4 Dynamic features with intact colors
A-2 Residual Bottleneck 4 Convolutional Layers 4 Transposed-Conv Layers 9 ResNet Blocks 256 9x4x4 Chaotic features with chaotic colors
A-3 Residual Bottleneck 3 Convolutional Layers 3 Transposed-Conv Layers 6 ResNet Blocks 128 9x8x8 Model collapses (overfit)
A-4 Residual Bottleneck 4 Convolutional Layers 4 Transposed-Conv Layers 3 ResNet Blocks 256 9x4x4 Dynamic features with poor colors

The result shows that, in all configurations, the detailed features in generated video were almost dynamic and chaotic. The colors of character are vague, making it very easy for a human to distinguish from real video. In some complex actions, the network got confused with the position of arms and legs, showing that it was not able to use temporal information well to make the generated video smooth.

3.4 U-Net Translator

Inspired by U-net for image segmentation, I tried to use this kind of structures as video frame translator. I tested two trials for this kind – 4 levels U-net with 3 shortcuts as picture below shows and 5 levels U-net. Both networks are trained on Stick-to-Miku dataset with MSE-loss for comparison between video splits. U-Net: https://arxiv.org/pdf/1505.04597.pdf

U-Net Structure
Trials Network Type Depth Middle Filters Middle Features Results
A-5 U-Net 4 128 9x8x8 Broken features with poor colors
A-6 U-Net 5 256 9x4x4 Broken features with intact colors

The results of two trials are worse than Trial A-1 using Residual Bottleneck. Although the frames generated by U-net are accurate in some cases, they loss too many temporal data, causing many features were damaged. I guessed it was caused by overfitting problem and I would explore more about U-net in adversarial training.

3.5 Deep ResNet Translator V1, V2

In order to evaluate influence of residual blocks, I conducted several trials which contained more residual blocks and fewer up-sampling or down-sampling layers. In this case, the network can go deeper benefiting from outstanding capability of residual connections to keep gradients stable. I have tested several trials on Stick-to-Miku dataset to stimulate conditional generation using movement data provided by stickman animation.

Trials Network Type Downsample Upsample Flat Convolution Middle Filter Middle Features Results
A-7 Deep ResNet 1 Conv Layer 1 Transposed-Conv Layer 14 ResNet Blocks 32 9x32x32 Vague features
A-8 Deep ResNet 1 Conv Layer 1 Transposed-Conv Layer 18 ResNet Blocks 32 9x32x32 Damaged features with vague colors
A-9 Deep ResNet 2 Conv Layers 2 Transposed-Conv Layers 14 ResNet Blocks 64 9x16x16 Smooth but vague
A-10 (128x128) Deep ResNet 2 Conv Layers 2 Transposed-Conv Layers 14 ResNet Blocks 64 9x32x32 Smooth but vague

After training of A-7 to A-9, I found many vague features created by networks. I guessed it was caused by the insufficient training data, so I double the frame size of video frames and re-train A-9 to get a high-resolution result. However, the result was poor as well. Then, I replaced some of layers to max-pooling to simplify the down-sampling procedure.

Trials Network Type Downsample Upsample Flat Convolution Middle Filters Middle Features Results
A-11 Deep ResNet V2 Conv Layer + MaxPool3D 2 Transposed-Conv Layer 14 ResNet Blocks 32 9x16x16 Smooth but vague
A-12 Deep ResNet V2 MaxPool3D + Cov Layer 2 Transposed-Conv Layers 14 ResNet Blocks 32 9x16x16 Smooth but unstable
A-13 (128x128) Deep ResNet V2 MaxPool3D + 2 Cov Layers 3 Transposed-Conv Layers 14 ResNet Blocks 64 9x8x8 Incomplete translation
A-14 (128x128) Deep ResNet V2 3x(MaxPool3D + Flat-Cov) 3 Transposed-Conv Layers 14 ResNet Blocks 64 9x8x8 Smooth features with poor colors

Due to the computing resources, I only trained Trial A-13 and A-14 for 30 epochs. The training of 128×128 frames is much slower than 64×64 frames, so I cannot make evaluation of these model until now. The 128×128 experiments will be conducted in GAN tests.

3.6 Deep ResNet Translator V3, V4, V5-A/B/C

To make the down-sampling more efficient, I directly combined down-sampling layers and flat convolutional layers together. Referring to the design of ResNet-29, I designed the expansion blocks to replace the function of old down-sampling layers. In V3 design, I also explored the use of convolution on the temporal dimension. In some trials, the stride of temporal convolution is 2 in order to expand the temporal feature maps.

Trials Network Type Downsample Upsample Normalization Temporal Compress Results
A-15 Deep ResNet V3 Maxpool + (3+4+6+2) ResNet Blocks 5 Transposed-Conv Layers Instance Normalization No Dynamic features with poor colors
A-16 Deep ResNet V3 Maxpool + (3+4+6+2) ResNet Blocks 5 Transposed-Conv Layers Large batchsize No Damaged video
A-17 Deep ResNet V3 Maxpool + (3+4+6+2) ResNet Blocks 5 Transposed-Conv Layers Instance Normalization 1/4 in Middle Incomplete translation
A-18 Deep ResNet V3 (2+3+4+6+2) ResNet Blocks 4 Transposed-Conv Layers Instance Normalization No Dynamic features with poor colors

In these results, the phenomenon of dynamic features is very similar to Trial A-1 to A-4, so I supposed that, with the use of more up-sampling layers, the feature generation becomes out of control caused by barriers of transferring gradients between layers, losing the relation of figures and locations from down-sampling stage. And that is also the reason why U-Net’s shortcut between down-sampling and up-sampling layers works so well on process of translating related features. To test this idea, I designed another type of residual translators named ResNet-V4 and ResNet-V5-A/B. With the use of residual connections and U-Net’s shortcuts though the whole network, it seems capable to map every level of features between input and output videos. To demonstrate the performance of this architecture, I conducted several trials about ResNet-V4 and ResNet-V5.

Trials Network Type Downsample Upsample Depth Temporal Compress Results
A-19 Deep ResNet V4 (1+2+3+4+2)
ResNet Blocks
(2+4+3+2+1)
ResNet Blocks
50 Layers No Dynamic features
A-20 Deep ResNet V4 (1+2+2+2+1)
ResNet Blocks
(1+2+2+2+1)
ResNet Blocks
34 Layers No Slightly dynamic features

The ResNet-V5-A/B/C and other experiments in this section will be released after I publish the formal paper or apply for protection of copyright.

In ResNet-V5-B, the number of convolutional layers is capable to reach 89 in 128×128 video translation using my newly-designed translative structure. So, in real application, it can hold more difficult translations and more complex situations.

3.7 Squeeze and Excitation Pipeline

Referring to Squeeze-and-Excitation structure in https://arxiv.org/pdf/1709.01507.pdf, I tried to add channel-wise gates for every 3D-ResNet block in order to increase capability of network. The diagram below shows the Squeeze and Excitation pipeline in a residual block.

Squeeze and Excitation Pipeline

4. Conditional Adversarial Training

4.1 Methodology

In this section, I used stickman input as network’s condition and several translative models as generator in a generative adversarial training to get higher quality of generated video clips. In previous experiments, I have tested many generative or translative models for video clips. Here, I took many tests on the design of different discriminator to maximize the performance of generative network.

Conditional GAN ( https://arxiv.org/pdf/1411.1784.pdf)

In original design, generator has two types of inputs – noise data (z) and condition (y). But, in my experiment where is designed to translate or generate video from an instructional video clips, noise data seemed no use to the whole structure, since it is originally designed to map features and diversify generator’s output. So, in this case, the noise data can be cancelled when I aim to only generate single character.

Illustration of Minimax Game in C-GAN

4.2 Discriminator and Training

Referring to conditional GAN, the discriminator generally contains a matrix concatenation process to combine the discriminative object and the conditions as the input of discriminator. After combination, like a normal GAN, I used several convolutional layers ended with a sigmoid function and I have trained the network with binary cross entropy loss function using zeros to represent fake video and ones to represent real video.

Binary Cross Entropy
Trials Discriminator Concatenation Temporal Compress Generator Normalization Balance Result
B-1 Concatenate + 4 Conv Layers 1 Conv Layer + Channel-wise Concatenate 1/4 in Last Two Layers ResNet-34
(Trial A-9)
Instance Normalization Discriminator is stronger Smooth but slightly dynamic
B-2 Concatenate + 5 Conv Layers Channel-wise Concatenate 1/4 in Last Two Layers ResNet-34
(Trial A-9)
Instance Normalization Pending Pending
B-3 Concatenate + 4 Conv Layers 1 Conv Layer + Channel-wise Concatenate 1/4 in Last Two Layers ResNet-V4-50
(Trial A-19)
Batch Normalization Discriminator is stronger Damaged features
B-4 Concatenate + 4 Conv Layers 1 Conv Layer + Channel-wise Concatenate 1/4 in Last Two Layers ResNet-V5-A-40 Instance Normalization Discriminator is slightly stronger Smooth but slightly dynamic
B-5 Concatenate + ResNet-11 1 Conv Layer + Channel-wise Concatenate 1/4 in Last Two Blocks ResNet-V5-B-71 Instance Normalization Discriminator is stronger Smooth but slightly dynamic

4.3 Different Generators

In this section, I conduct several trials using the translator model in previous tests and evaluate their performance in adversarial training.

Trials Generator Generator Type Discriminator Results
B-6 Trial A-1 Bottleneck-22 5 Conv Layers with Direct Concatenation Dynamic features
B-7 Trial A-6 U-Net-22 5 Conv Layers with Direct Concatenation Pending
B-8 Trial A-9 ResNet-V1-32 5 Conv Layers with Direct Concatenation Smooth but with slightly-dynamic features

4.4 Deep Networks and Mixed Training

The ResNet-V5-A/B/C and other experiments in this section will be released after I publish the formal paper or apply for protection of copyright.

Trials Generator Depth Discriminator Training Result
B-9 ResNet-V5-A 40 Layers 6 Conv Layers C-GAN Loss Slightly damaged features
B-10 ResNet-V5-B 71 Layers 6 Conv Layers C-GAN Loss -
B-11
(128x128)
ResNet-V5-B 89 Layers ResNet-11 C-GAN Loss -
B-12 ResNet-V5-C1 45 Layers 6 Conv Layers C-GAN Loss -
B-13 ResNet-V5-C2 54 Layers ResNet-11 Mixed Loss -
B-14 ResNet-V5-C3 60 Layers 6 Conv Layers Mixed Loss -
B-15 ResNet-V5-C4 68 Layers 6 Conv Layers Mixed Loss -

5. Environment

All the networks are trained on Laptop with GTX960M and Google Compute Engine with Tesla K80 and Tesla P100 depending on training time. All the codes and sampling jobs are completed in macOS 10.13 and Ubuntu 16.04 LTS.

5.1 Miku Datasets

I have prepared three datasets for dynamic object translation – small Miku-to-Miku dataset, small Stick-to-Miku dataset, and full Stick-to-Miku dataset. The first two datasets have approximately 3000 image pairs and the full dataset have 10000 image pairs. Miku-to-Miku datasets contain anime character dancing in two dressing styles aiming to test model’s performance of translation. Stick-to-Miku datasets contain the same dances performed by stickman and Miku separately aiming to simulate conditional generation of dynamic object from movement data. All the video frames are generated by Miku-Miku-Dance software and the character models’ authors are listed in reference.

Video Player
http://35.238.106.135/wp-content/uploads/2018/01/sample.mp4
00:01
00:04
00:29
Use Up/Down Arrow keys to increase or decrease volume.

Data Sample Preview: Stickman (Left), Electric Miku (Middle), TDA Miku (Right)

5.2 House-to-Zebra Dataset

5.3 Video Experiment Platform

In order to conduct these experiments and realize different models, I designed a video experiment platform which can split videos, generate datasets, train different networks and, generate a video with comparison of original data. I designed a frame traverse slider which can generate arbitrary sequence of frames as network’s input. I will release it for research use after I publish my formal paper.

During experiments, due to the slow computational speed of python object class, I optimized the allocation of GPU and CPU by reducing algorithm complexity and increasing batch-size so that the network-trainer can occupy full GPU with least interval caused by slow “for” loop computed by CPU. According to test result, the old version of frame slider occupy 96% of GPU usage. In comparison, the new slider can use full GPU core. On average, the training speed on laptop with GTX960M rises 30%.

6. Limitations

Although this new architecture works better than traditional U-Net, the generated videos are still very easy for human to distinguish from real ones, mainly because of vague textures. I believe, with the increase of frame size and network’s scale, this architecture will be applied to more application scenarios.

 

7. Results

Architecture

Train

Discriminator

Loss

Score (low is better)

5-level U-Net (baseline)

Direct

1.1488

0

4-level U-Net

Direct

1.1999

+0.0511

Conv-Deconv-1

Direct

1.1739

+0.0251

Conv-Deconv-2

Direct

1.1419

-0.0069

Conv-Deconv-3

Direct

1.1425

-0.0063

Conv-Deconv-4

Direct

1.1410

-0.0078

Conv-Deconv-5

C-GAN

Conv

1.1436

-0.0052

Residual U-Net

C-GAN

Conv

1.1432

-0.0056

ResNet-V5-C

C-GAN

Conv

1.1352

-0.0136

8. Conclusions

To sum up, with the use of 3D convolution, the generative video model works so well in various applications. Residual learning framework significantly enhances the performance of traditional U-Net.

你可能感兴趣的:(gan,Video,cnn)