1. Abstract
After building image translator powered by CycleGAN, I found a problem that, in generation and translation of videos, traditional model was not capable to deal with temporal dimension, causing generated videos collapse. In this work, referring to 3D-convolutional networks for spatial-temporal data representation, I conduct experiments aiming to find appropriate and powerful neural network structures on generation of coherent video with dynamic object. Till now, I found that three dimensional deep residual network has best performance in extracting, filling and transforming features with temporal information. Based on residual connection, I designed ResNet-V5-A/B/C network with adversarial training which is capable to generate realistic video clips in situation of unbalanced translation.
2. Introduction
2.1 Inspiration
These days, I have read many tutorials and papers about video problem in computer vision with machine learning. Many famous papers like 3-D-convolution, 3-D-ResNet and two-stream network inspired me a lot. Meanwhile, I was guessing that could GANs in my previous work translate videos. One day, I found a clue in CycleGAN’s introduction which was an generated animation about house-to-zebra translation. Author splits every frame from a house recording and translates them side-by-side using a trained CycleGAN. But something strange in his result – there are many dynamic figures on zebra (the real should be fixed) which meant that the network could not get the relation between different frames, but instead, it only focus on spatial information. In other word, the network was not capable to deal the sequence information behind the video. Then I started to explore computer vision problems about temporal dimension.
2D-CycleGAN: House-to-zebra (https://arxiv.org/pdf/1703.10593.pdf)
2.2 Three-Dimensional Convolution
In one of methods, the network contain full 3D convolution to cover two spatial dimensions and one temporal dimension. Apparently, with the involvement of another axis, the convolutional kernel can easily extract the relation between spatial features and temporal information. I will use this simple but strong network structure as backbone in this section. Due to original intention, I will test two-stream network and other structures in the future, because, compared to 3D-Conv method, it cannot keep full information between spatial and temporal features when using separate dimensions. So I will mainly focus on 3-dimensional convolutional network to build a productive generative models. Due to limit of dataset, the experiment only focus on dynamic object generation, but I will test dynamic scenes with background mask in the future.
3D-GAN: http://carlvondrick.com/tinyvideo/3. Basic Model
In this part, I will only train a translative network mainly to test network’s capability of gathering, mapping, reshaping features with temporal information. The results helped to design the generator of a complete generative adversarial model. In these tests, I will use the small dataset to adjust network’s structure and full dataset to evaluate the performance.
3.1 Methodology
I used Sketch Simplify 2D-convolutional translator for reference and adjusted its structure and parameters to fit in 3-D environment. In following experiments, I conducted different structures and different training methods to find the best model. In consideration of training speed, at beginning, I only used small-scale datasets and limited the max number of filters to 32.
Sketch Simplify: http://hi.cs.waseda.ac.jp/~esimo/en/research/sketch/The diagram above shows the structure of a fully-connected bottleneck-like network for sketch simplicity. In this section, I may test many types of similar structures – residual blocks in the middle, encoder and decoder without flat convolution, ‘fully-residual’ network. All networks have 16 channels for 64-by-64 frames in input layer and identically has 16 channels in output. The number of channel doubles when going to the next layer in down-sampling stage and oppositely decrease two times in up-sampling stage, and keeps the same in middle stage.
3.2 Loss Function
In order to compare the difference of generated video frames and target frames, in this section, the networks were all trained with Gradient Descent optimizer using MSE-loss. Compared to L1-loss, squared loss enables networks to filter more unrelated data which benefits to have less noise in result, as well as train faster. In Trial A-1, I have trained two networks to test both functions. The result showed that frames generated by MSE-loss have less miscellaneous points in first 100 epochs of training and the loss descents faster.
Mean Squared Error3.3 Bottleneck Translator
Applying basic method mentioned in 3.1, I designed the model having three stages – down-sampling, residual blocks as flat convolution, and up-sampling. These trials will disclose capability of bottleneck-like model in related job. In real tests, I tested structures containing some convolutional layers as down-sampling stage, transposed-convolutional layers as up-sampling stage, and finally with a convolutional layer for generating frames (RGB filters).
Trials | Network Type | Downsample | Upsample | Flat Convolution | Middle Filters | Middle Features | Results |
---|---|---|---|---|---|---|---|
A-1 | Residual Bottleneck | 4 Convolutional Layers | 4 Transposed-Conv Layers | 6 ResNet Blocks | 256 | 9x4x4 | Dynamic features with intact colors |
A-2 | Residual Bottleneck | 4 Convolutional Layers | 4 Transposed-Conv Layers | 9 ResNet Blocks | 256 | 9x4x4 | Chaotic features with chaotic colors |
A-3 | Residual Bottleneck | 3 Convolutional Layers | 3 Transposed-Conv Layers | 6 ResNet Blocks | 128 | 9x8x8 | Model collapses (overfit) |
A-4 | Residual Bottleneck | 4 Convolutional Layers | 4 Transposed-Conv Layers | 3 ResNet Blocks | 256 | 9x4x4 | Dynamic features with poor colors |
The result shows that, in all configurations, the detailed features in generated video were almost dynamic and chaotic. The colors of character are vague, making it very easy for a human to distinguish from real video. In some complex actions, the network got confused with the position of arms and legs, showing that it was not able to use temporal information well to make the generated video smooth.
3.4 U-Net Translator
Inspired by U-net for image segmentation, I tried to use this kind of structures as video frame translator. I tested two trials for this kind – 4 levels U-net with 3 shortcuts as picture below shows and 5 levels U-net. Both networks are trained on Stick-to-Miku dataset with MSE-loss for comparison between video splits. U-Net: https://arxiv.org/pdf/1505.04597.pdf
U-Net StructureTrials | Network Type | Depth | Middle Filters | Middle Features | Results |
---|---|---|---|---|---|
A-5 | U-Net | 4 | 128 | 9x8x8 | Broken features with poor colors |
A-6 | U-Net | 5 | 256 | 9x4x4 | Broken features with intact colors |
The results of two trials are worse than Trial A-1 using Residual Bottleneck. Although the frames generated by U-net are accurate in some cases, they loss too many temporal data, causing many features were damaged. I guessed it was caused by overfitting problem and I would explore more about U-net in adversarial training.
3.5 Deep ResNet Translator V1, V2
In order to evaluate influence of residual blocks, I conducted several trials which contained more residual blocks and fewer up-sampling or down-sampling layers. In this case, the network can go deeper benefiting from outstanding capability of residual connections to keep gradients stable. I have tested several trials on Stick-to-Miku dataset to stimulate conditional generation using movement data provided by stickman animation.
Trials | Network Type | Downsample | Upsample | Flat Convolution | Middle Filter | Middle Features | Results |
---|---|---|---|---|---|---|---|
A-7 | Deep ResNet | 1 Conv Layer | 1 Transposed-Conv Layer | 14 ResNet Blocks | 32 | 9x32x32 | Vague features |
A-8 | Deep ResNet | 1 Conv Layer | 1 Transposed-Conv Layer | 18 ResNet Blocks | 32 | 9x32x32 | Damaged features with vague colors |
A-9 | Deep ResNet | 2 Conv Layers | 2 Transposed-Conv Layers | 14 ResNet Blocks | 64 | 9x16x16 | Smooth but vague |
A-10 (128x128) | Deep ResNet | 2 Conv Layers | 2 Transposed-Conv Layers | 14 ResNet Blocks | 64 | 9x32x32 | Smooth but vague |
After training of A-7 to A-9, I found many vague features created by networks. I guessed it was caused by the insufficient training data, so I double the frame size of video frames and re-train A-9 to get a high-resolution result. However, the result was poor as well. Then, I replaced some of layers to max-pooling to simplify the down-sampling procedure.
Trials | Network Type | Downsample | Upsample | Flat Convolution | Middle Filters | Middle Features | Results |
---|---|---|---|---|---|---|---|
A-11 | Deep ResNet V2 | Conv Layer + MaxPool3D | 2 Transposed-Conv Layer | 14 ResNet Blocks | 32 | 9x16x16 | Smooth but vague |
A-12 | Deep ResNet V2 | MaxPool3D + Cov Layer | 2 Transposed-Conv Layers | 14 ResNet Blocks | 32 | 9x16x16 | Smooth but unstable |
A-13 (128x128) | Deep ResNet V2 | MaxPool3D + 2 Cov Layers | 3 Transposed-Conv Layers | 14 ResNet Blocks | 64 | 9x8x8 | Incomplete translation |
A-14 (128x128) | Deep ResNet V2 | 3x(MaxPool3D + Flat-Cov) | 3 Transposed-Conv Layers | 14 ResNet Blocks | 64 | 9x8x8 | Smooth features with poor colors |
Due to the computing resources, I only trained Trial A-13 and A-14 for 30 epochs. The training of 128×128 frames is much slower than 64×64 frames, so I cannot make evaluation of these model until now. The 128×128 experiments will be conducted in GAN tests.
3.6 Deep ResNet Translator V3, V4, V5-A/B/C
To make the down-sampling more efficient, I directly combined down-sampling layers and flat convolutional layers together. Referring to the design of ResNet-29, I designed the expansion blocks to replace the function of old down-sampling layers. In V3 design, I also explored the use of convolution on the temporal dimension. In some trials, the stride of temporal convolution is 2 in order to expand the temporal feature maps.
Trials | Network Type | Downsample | Upsample | Normalization | Temporal Compress | Results |
---|---|---|---|---|---|---|
A-15 | Deep ResNet V3 | Maxpool + (3+4+6+2) ResNet Blocks | 5 Transposed-Conv Layers | Instance Normalization | No | Dynamic features with poor colors |
A-16 | Deep ResNet V3 | Maxpool + (3+4+6+2) ResNet Blocks | 5 Transposed-Conv Layers | Large batchsize | No | Damaged video |
A-17 | Deep ResNet V3 | Maxpool + (3+4+6+2) ResNet Blocks | 5 Transposed-Conv Layers | Instance Normalization | 1/4 in Middle | Incomplete translation |
A-18 | Deep ResNet V3 | (2+3+4+6+2) ResNet Blocks | 4 Transposed-Conv Layers | Instance Normalization | No | Dynamic features with poor colors |
In these results, the phenomenon of dynamic features is very similar to Trial A-1 to A-4, so I supposed that, with the use of more up-sampling layers, the feature generation becomes out of control caused by barriers of transferring gradients between layers, losing the relation of figures and locations from down-sampling stage. And that is also the reason why U-Net’s shortcut between down-sampling and up-sampling layers works so well on process of translating related features. To test this idea, I designed another type of residual translators named ResNet-V4 and ResNet-V5-A/B. With the use of residual connections and U-Net’s shortcuts though the whole network, it seems capable to map every level of features between input and output videos. To demonstrate the performance of this architecture, I conducted several trials about ResNet-V4 and ResNet-V5.
Trials | Network Type | Downsample | Upsample | Depth | Temporal Compress | Results |
---|---|---|---|---|---|---|
A-19 | Deep ResNet V4 | (1+2+3+4+2) ResNet Blocks |
(2+4+3+2+1) ResNet Blocks |
50 Layers | No | Dynamic features |
A-20 | Deep ResNet V4 | (1+2+2+2+1) ResNet Blocks |
(1+2+2+2+1) ResNet Blocks |
34 Layers | No | Slightly dynamic features |
The ResNet-V5-A/B/C and other experiments in this section will be released after I publish the formal paper or apply for protection of copyright.
In ResNet-V5-B, the number of convolutional layers is capable to reach 89 in 128×128 video translation using my newly-designed translative structure. So, in real application, it can hold more difficult translations and more complex situations.
3.7 Squeeze and Excitation Pipeline
Referring to Squeeze-and-Excitation structure in https://arxiv.org/pdf/1709.01507.pdf, I tried to add channel-wise gates for every 3D-ResNet block in order to increase capability of network. The diagram below shows the Squeeze and Excitation pipeline in a residual block.
Squeeze and Excitation Pipeline4. Conditional Adversarial Training
4.1 Methodology
In this section, I used stickman input as network’s condition and several translative models as generator in a generative adversarial training to get higher quality of generated video clips. In previous experiments, I have tested many generative or translative models for video clips. Here, I took many tests on the design of different discriminator to maximize the performance of generative network.
Conditional GAN ( https://arxiv.org/pdf/1411.1784.pdf)In original design, generator has two types of inputs – noise data (z) and condition (y). But, in my experiment where is designed to translate or generate video from an instructional video clips, noise data seemed no use to the whole structure, since it is originally designed to map features and diversify generator’s output. So, in this case, the noise data can be cancelled when I aim to only generate single character.
Illustration of Minimax Game in C-GAN4.2 Discriminator and Training
Referring to conditional GAN, the discriminator generally contains a matrix concatenation process to combine the discriminative object and the conditions as the input of discriminator. After combination, like a normal GAN, I used several convolutional layers ended with a sigmoid function and I have trained the network with binary cross entropy loss function using zeros to represent fake video and ones to represent real video.
Binary Cross EntropyTrials | Discriminator | Concatenation | Temporal Compress | Generator | Normalization | Balance | Result |
---|---|---|---|---|---|---|---|
B-1 | Concatenate + 4 Conv Layers | 1 Conv Layer + Channel-wise Concatenate | 1/4 in Last Two Layers | ResNet-34 (Trial A-9) |
Instance Normalization | Discriminator is stronger | Smooth but slightly dynamic |
B-2 | Concatenate + 5 Conv Layers | Channel-wise Concatenate | 1/4 in Last Two Layers | ResNet-34 (Trial A-9) |
Instance Normalization | Pending | Pending |
B-3 | Concatenate + 4 Conv Layers | 1 Conv Layer + Channel-wise Concatenate | 1/4 in Last Two Layers | ResNet-V4-50 (Trial A-19) |
Batch Normalization | Discriminator is stronger | Damaged features |
B-4 | Concatenate + 4 Conv Layers | 1 Conv Layer + Channel-wise Concatenate | 1/4 in Last Two Layers | ResNet-V5-A-40 | Instance Normalization | Discriminator is slightly stronger | Smooth but slightly dynamic |
B-5 | Concatenate + ResNet-11 | 1 Conv Layer + Channel-wise Concatenate | 1/4 in Last Two Blocks | ResNet-V5-B-71 | Instance Normalization | Discriminator is stronger | Smooth but slightly dynamic |
4.3 Different Generators
In this section, I conduct several trials using the translator model in previous tests and evaluate their performance in adversarial training.
Trials | Generator | Generator Type | Discriminator | Results |
---|---|---|---|---|
B-6 | Trial A-1 | Bottleneck-22 | 5 Conv Layers with Direct Concatenation | Dynamic features |
B-7 | Trial A-6 | U-Net-22 | 5 Conv Layers with Direct Concatenation | Pending |
B-8 | Trial A-9 | ResNet-V1-32 | 5 Conv Layers with Direct Concatenation | Smooth but with slightly-dynamic features |
4.4 Deep Networks and Mixed Training
The ResNet-V5-A/B/C and other experiments in this section will be released after I publish the formal paper or apply for protection of copyright.
Trials | Generator | Depth | Discriminator | Training | Result |
---|---|---|---|---|---|
B-9 | ResNet-V5-A | 40 Layers | 6 Conv Layers | C-GAN Loss | Slightly damaged features |
B-10 | ResNet-V5-B | 71 Layers | 6 Conv Layers | C-GAN Loss | - |
B-11 (128x128) |
ResNet-V5-B | 89 Layers | ResNet-11 | C-GAN Loss | - |
B-12 | ResNet-V5-C1 | 45 Layers | 6 Conv Layers | C-GAN Loss | - |
B-13 | ResNet-V5-C2 | 54 Layers | ResNet-11 | Mixed Loss | - |
B-14 | ResNet-V5-C3 | 60 Layers | 6 Conv Layers | Mixed Loss | - |
B-15 | ResNet-V5-C4 | 68 Layers | 6 Conv Layers | Mixed Loss | - |
5. Environment
All the networks are trained on Laptop with GTX960M and Google Compute Engine with Tesla K80 and Tesla P100 depending on training time. All the codes and sampling jobs are completed in macOS 10.13 and Ubuntu 16.04 LTS.
5.1 Miku Datasets
I have prepared three datasets for dynamic object translation – small Miku-to-Miku dataset, small Stick-to-Miku dataset, and full Stick-to-Miku dataset. The first two datasets have approximately 3000 image pairs and the full dataset have 10000 image pairs. Miku-to-Miku datasets contain anime character dancing in two dressing styles aiming to test model’s performance of translation. Stick-to-Miku datasets contain the same dances performed by stickman and Miku separately aiming to simulate conditional generation of dynamic object from movement data. All the video frames are generated by Miku-Miku-Dance software and the character models’ authors are listed in reference.
Data Sample Preview: Stickman (Left), Electric Miku (Middle), TDA Miku (Right)
5.2 House-to-Zebra Dataset
5.3 Video Experiment Platform
In order to conduct these experiments and realize different models, I designed a video experiment platform which can split videos, generate datasets, train different networks and, generate a video with comparison of original data. I designed a frame traverse slider which can generate arbitrary sequence of frames as network’s input. I will release it for research use after I publish my formal paper.
During experiments, due to the slow computational speed of python object class, I optimized the allocation of GPU and CPU by reducing algorithm complexity and increasing batch-size so that the network-trainer can occupy full GPU with least interval caused by slow “for” loop computed by CPU. According to test result, the old version of frame slider occupy 96% of GPU usage. In comparison, the new slider can use full GPU core. On average, the training speed on laptop with GTX960M rises 30%.
6. Limitations
Although this new architecture works better than traditional U-Net, the generated videos are still very easy for human to distinguish from real ones, mainly because of vague textures. I believe, with the increase of frame size and network’s scale, this architecture will be applied to more application scenarios.
7. Results
Architecture |
Train |
Discriminator |
Loss |
Score (low is better) |
5-level U-Net (baseline) |
Direct |
– |
1.1488 |
0 |
4-level U-Net |
Direct |
– |
1.1999 |
+0.0511 |
Conv-Deconv-1 |
Direct |
– |
1.1739 |
+0.0251 |
Conv-Deconv-2 |
Direct |
– |
1.1419 |
-0.0069 |
Conv-Deconv-3 |
Direct |
– |
1.1425 |
-0.0063 |
Conv-Deconv-4 |
Direct |
– |
1.1410 |
-0.0078 |
Conv-Deconv-5 |
C-GAN |
Conv |
1.1436 |
-0.0052 |
Residual U-Net |
C-GAN |
Conv |
1.1432 |
-0.0056 |
ResNet-V5-C |
C-GAN |
Conv |
1.1352 |
-0.0136 |
8. Conclusions
To sum up, with the use of 3D convolution, the generative video model works so well in various applications. Residual learning framework significantly enhances the performance of traditional U-Net.