【1】Zhou et al. Informer:Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. AAAI 2021 Best Paper.
长序列的时序预测问题(LSTF)要求模型具有很高的预测能力,即能够有效地捕捉输出和输入之间精确的长程相关性,Transformer具有此方面的潜力但仍存在一些严重问题,如二次方的时间复杂度,高内存使用率,以及encoder-decoder体系结构的固有限制。为了解决上述问题,本文提出了新的LSTF模型Informer。
【2】Alec et al. Improving Language Understanding by Generative Pre-Training.arXiv 2018.(GPT)
本文探索了一种用于NLP任务的半监督模型GPT,其采用了无监督的pretrain和有监督的finetune相结合的范式,具体来说,是使用了12层单向的Transformer,通过预训练+精调的方式进行训练。
【3】Alec et al. Language Models Are Unsupervised Multitask Learners. arXiv 2019.(GPT-2)
此文提出了GPT-2模型,提出了meta learning的范式,也贡献了一个新的40Gb超大数据集——WebText,用于训练GPT-2。
【4】Tom et al. Language Models are Few-Shot Learners. arXiv 2020.(GPT-3)
此文提出了GPT-3模型,模型参数量高达1750亿参数,是之前方法的十倍以上的模型容量,GPT-3的主要目标是用更少的领域数据、且不经过精调步骤去解决问题。
【5】Wu et al. CvT: Introducing Convolutions to Vision Transformers. arXiv 2021.
【6】Alexey et al. An Image is Worth 16x16 Words: Transformer for Image Recognition at Scale. ICLR 2021.
【7】Yun et al. Are Transformer Universial Approximators of Sequence-to-Sequence Functions? ICLR 2020.
【8】Yun et al. O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformer. NIPS 2020.
【9】Wang et al. Exploring Font-independent Features for Scene Text Recognition. ACM MM 2020.
本文指出现有框架都没有一个显式机制去保证可以去除文字风格信息,限制了其泛化能力,因此目前的方法当遇到新的文字风格样本时都表现得比较差。本文用GAN把多种字体的CNN特征转换为字体的骨架(glyph)。此GAN由各个time step的glimpse作为font embedding来指导,以此来学习font-independent特征。
【10】Mou et al. PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit. ECCV 2020.
本文针对在高度模糊或低分辨率场景中的文字识别难的问题,提出了一种a pluggable super-resolution unit (可插拔的超分辨率单元)。传统超分辨率方法是从图像层面去解决的,例如ESRGAN-Aster,这种方法有个明显缺点就是非常的耗时,作者基于此,研究了一种从特征层面的超分辨率方法。具体来说,作者提出了FSM(特征压缩模块)来保留更多的位置信息,也即1*1卷积+reshape layer去生成1维vector,FEM(图像增强模块)来combine由低到高的特征,也即类似于FPN的multi-level特征融合。
【11】Fang et al. Read Like Humans: Autonomous , Bidirectional and Iterative Language Modelling for Scene Text Recognition. CVPR 2021.
【12】Aviad et al. Sequence-to-Sequence Constrastive Learning for Text Recognition.arXiv 2020.
【13】Yang et al. Convolutional Prototype Network for Open Set Recognition. PAMI 2021.
【14】Qiao et al. Gaussian Constrained Attention Network for Scene Text Recognition. ICPR 2020.
【15】Zhang et al. FedOCR: Communication-Efficient Federated Learning for Scene Text Recognition. arXiv 2020.
【16】Yousef et al. OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text Recognition by Learning to Unfold. CVPR 2020.
【17】Baek et al. What If We Only Use Real Datasets for Scene Text Recognition?Toward Scene Text Recognition with Fewer Labels. CVPR 2021.
本文的motivation是目前STR有一个隐性的常识——用大型合成数据集来训练模型,这样的做法目前有两个问题,一是生成target domain的合成数据比较困难,例如手写字和艺术字;二是非英语的语种合成数据还是非常稀少的。因此,本文想探索一种仅依赖于真实label训练的STR模型。本文STR Model Framework是Back et al, ICCV 2019提出的四阶段模型,,然后运用了2个学习方法来利用真实标签:1)以伪标签思想的半监督方法,及以Mean Teacher思想的半监督方法;2)以RotNet和MoCo思想的自监督方法。
【18】Zhang et al. SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition. AAAI 2021.
本文提出了一种结构保留内偏移网络(SPIN),用于解决色度失真问题(chromatic distortion),作者将色度失真分为两类:1)inter-pattern problem,也就是文字和背景的强度接近,难以区分(对比度差);2)intra-pattern problem,也即受到外部噪声影响,例如阴影,遮挡物等因素。作者为了解决色度失真问题,基于MORAN的offset进一步扩展。在channel上进行intensity offset。SPIN分为两个子模块,分别是Structure-preserving network(参考了Structure Preserving Transformation (SPT) (Peng, Zheng, and Zhang 2019))和Auxiliary Inner-offset Network(通过input image x 和spatial offset x'进行加权和,得到updated image参与SPT作为channel offset)
【19】Peng et al.Structure-Preserving Transformation: Generating Diverse and Transferable Adversarial Examples. AAAI 2019 Workshop.
本文提到的structure preserve transformatiom其实就是一个singleton-based的转换,定义为g(x) = 。简单理解就是对于每个pixel进行可学习的单值幂指数变换,作者指出这样做的好处就是可以保留原图像的structure pattern。
【20】Aviad et al. Sequence-to-Sequence Constrastive Learning for Text Recognition.arXiv 2020.
【21】Minghui Liao et al. Real-time Scene Text Detection with Differentiable Binarization. AAAI 2020.
本文是为了解决文字检测后处理步骤不可微、繁琐的问题,思想是去学习threshold map + sigmoid函数拟合阶跃函数
【22】Zhaoyi Wan et al. TextScanner: Reading Characters in Order for Robust Scene Text Recognition. AAAI 2020.
文章思想是Order Segmentation + Localization map + Character Segmentation,亮点在于字符阅读顺序的学习,用了分割和互监督的思想
【23】Wenyang Hu et al. GTC Guided Training of CTC Towards Efficient and Accurate Scene Text Recognition. AAAI 2020.
文章思想是 Attentional guidance + GCN to build correlation among feature(基于相似性投影和距离矩阵的aggregation方式来做节点分类)
【24】Yiqing Hu et al. Accurate Structured-Text Spotting for Arithmetical Exercise Correction. AAAI 2020
【25】Hao Wang et al. All You Need Is Boundary- Toward Arbitrary-Shaped Text Spotting. AAAI 2020.
【26】Songxuan Lai et al. SynSig2Vec-Learning Representations from Synthetic Dynamic Signatures for Real-world Verification. AAAI 2020.
【27】Wei Li et al. FET-GAN Font and Effect Transfer via K-shot Adaptive Instance Normalization. AAAI 2020.
【28】Kipf et al. SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOULUTIONAL NETWORKS. ICLR 2017.
【29】Velickovic et al. Graph Attention Networks. ICLR 2018
【30】Tianwei Wang et al. Decouple Attention Network for Text Recognition. AAAI 2020.
本文思想是提出了一个字符定位和解码的解耦的框架,字符定位模块是FCN结构
【31】Yiming Gao. GAN-Based Unpaired Chinese Character Image Translation via Skeleton Transformation and Stroke Rendering. AAAI 2020
【32】Jun Tang et al. Seglink++: Detecting Dense and Arbitrary-shaped Scene Text by Instance-aware Component Grouping. PR 2019.
【33】Liu et al. ABCNet: Arbitrarily-shaped Scene Text Spotting with Adaptive Bezier-curve Network in Real Time. CVPR 2020.
【34】Bruna et al. Spectral Networks and Locally Connected Networks on Graphs. arXiv 2013.
【35】Defferard et al. Convolutional Neural Networks on Graphs with Fast Localized Filtering. NIPS 2016
【36】Hamilton et al. Inductive Representation Learning on Large Graphs. NIPS 2017.
【37】Liang Qiao et al. Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting. AAAI 2020.
本文思想是Order aware segmentation detect + fiducial point finetune
【38】Busta et al. Deep Textspotter: An End-to-End Trainable Scene Text Localization and Recognition Framework. ICCV 2019.
【39】Mali et al. ScanSSD: Scanning Single Shot Detector for Mathematical Formulas in PDF Document Images. arXiv 2020.
【40】Islam et al. How Much Position Information Do Convolutional Neural Network Encode. ICLR 2020.
【41】Xinlong Wang et al. SOLO: Segmentation Objects by Locations. ECCV 2020.
本文思想是Category branch + Mask branch,提出了instance category的概念
【42】Xu et al. LayoutLM: Pre-training of Text and Layout for Document Understanding. arXiv 2020.
【43】Zhongdao Wang et al. Linkage Based Face Clustering via Graph Convolution Network. CVPR 2019.
本文思想是利用节点中心子图的构建转换成linkage-based的问题
【44】Zhang et al. Efficient Backbone Search for Scene Text Recognition. arXiv 2020.
【45】Zhang et al. Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection. CVPR 2020.
【46】Yang et al. SwapText: Image Based Texts Transfer in Scenes. CVPR 2020
【47】Feng et al. Scene Text Recognition via Transformer. arXiv 2020
【48】Ma et al. ReLaText: Exploiting Visual Relationships for Arbitrary-Shaped Scene Text Detection with Graph Convolutional Networks. arXiv 2020
【49】Litman et al. SCATTER: Selective Context Attentional Scene Text Recognizer. CVPR 2020.
【50】Long et al. UnrealText: Synthesizing Realistic Scene Text Images From the Unreal World. CVPR 2020
【51】Chen et al. Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning. CVPR 2020.
【52】Baek et al. CRAFT: Character Region Awareness for Text Detection. CVPR 2019.
文章运用了合成样本和真实样本的联合弱监督训练,提出了character affinity的概念
【53】Tarvainen et al. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NIPS 2017.
Mean Teacher 模型主要想解决 Temporal ensembling Model 的一个突出问题,即无标签数据的信息只能在下一次 epoch 时才能更新到模型中。由此带来两个问题:1)大数据集下,模型更新缓慢;2)无法实现模型的在线训练。因此Mean Teacher的主要思想是:模型既充当学生,又充当老师。作为老师,用来产生学生学习时的目标;作为学生,则利用教师模型产生的目标来进行学习。而教师模型的参数是由历史上(前几个step)几个学生模型的参数经过加权平均得到。
【54】He et al. Momentum Contrast for Unsupervised Visual Representation Learning.CVPR 2020.
【55】Ron et al. On Calibration of Scene Text Recognition Models. arXiv 2020.
【56】Xie et al. Aggregation Cross Entropy for Sequence Recognition. CVPR 2019.
为了更完整地理解原理,并探索其中的尺度不平等问题的解决方法,本次重读了ACE论文,ACE指出序列识别任务一个关键难题是序列中l-th字符与模型预测y_k^t的对齐关系是不清楚的,也就是对齐问题,因此难以求每个字符的预测概率损失。而ACE思想则是将y_k^t的求和统计分布与label的求和分布来计算交叉熵,则绕开了求每个字符的预测概率损失带来对齐问题,因为监督信息只需要每个类别在序列中的累加频次。
【57】Nguyen et al. Dictionary-guided Scene Text Recognition. CVPR 2021.
本文将视觉特征与词典技术在识别模型中进行关联学习,而不是纯粹的结果求ed距离。具体来说,让feature map v与词典列表的L损失的直方图接近ground truth y与词典列表的ed距离直方图。其中两个直方图分布之间使用KL divergence计算损失,L损失则是基于attention的概率矩阵P来计算的。另外,本文也贡献了一个新的越南语言的场景文本数据集,含2000张全标注的图片。
【59】Ayan et al; Text is Text, No matter What: Unifying Text Recognition using Knowledge Distillation.
本文用了知识蒸馏(knowledge distillation)的方法去设计一个scene-text与handwritten-text统一的模型,具体是使用两个specilised teacher,分别用一个场景文本数据pre-trained好的teacher和一个用手写文本pre-trained好的teacher,去教导一个unified teacher。