1.To the best of our knowledge 据我们所知
2.In order to bridge this gap 为了弥合这一差距
3.It is worth mentioning that 值得一提的是
4It is should be emphasized that 应该强调的是
5.Comparisons with State-of-the-arts 与最先进的技术比
1.Deep Convolutional Neural Networks (CNNs) are powerful models that have achieved excellent performance on difficult computer vision tasks.
2.Although CNNS perform well whenever large labeled training samples are available,they work badly on video frame synthesis due to objects deforming and moving, scene lighting changes, and cameras moving in video sequence.
3.In this paper, we present a novel and general end-to-end architecture, called convolutional Transformer or ConvTransformer, for video frame sequence learning and video frame synthesis.
1.Video frame synthesis, aiming to synthesize spatially and temporally coherent intermediated frames between two consecutive reals frames or synthesize the future frames of a frame sequence, is a classical and fundamental problem in video processing and computer vision community.
2.The abrupt motion artifacts and temporal aliasing in video sequence can be compressed with the help of video frame synthesis, and hence it can be applied to numerous appli cations ranging from motion deblurring [5] to video frame rate up-sampling [7, 3], video editing [23, 37], novel view
synthesis [9] and autonomous vehicle [34].
3.Although these methods perform well when optical flow is accurately estimated, they would generate motion blur and artifacts with inaccurate optical estimation.
4.In order to bridge this gap, we, in this work, propose a general end-to-end video frame synthesis network, i.e., convolutional Transformer (ConvTransformer), which simplifies the video frame synthesis as an encoder and decoder problem.
5.The main contributions of this paper are therefore as follows.
1.Although their methods achieves high-quality results, these methods suffer from heavy computation and are sensitive to motion.
2.In order to overcome this issue, a convolutional Transformer (ConvTransformer) is proposed in this work, and has been successfully applied to video frame synthesis. The experiment results show that the simplified architecture ConvTransformerachieves competitive results as compared to state-of-the-art well-designed networks, such as MCNet, DAIN and BMBC. To the best of our knowledge, it is the
first time that ConvTransformer architecture is proposed.
1.For a fair comparison, we implemented and retrained these methods with the same trainset for training ConvTransformer.
2.In order to evaluate and justify the efficiency and superiority of each part in the proposed ConvTransformer architecture, several ablation experiments have been conducted
in this work
1.In this work, we propose a novel video frame synthesis architecture ConvTransformer, which not only works well on video frame extrapolation, but also interpolates photo-realistic middle frames.
2.Extensive quantitative and qualitative evaluations indicate that the proposed solution ConvTransformer performsfavorablyagainst existing frame extrapolation and interpolation methods.
3. The successfully implementation of ConvTransformersheds light for applying it for other video
tasks that need to exploit the long-term sequential dependence in video.