文档页面视为坐标系统(左上为原点), 使用2张embedding table构造4种位置嵌入,横纵轴各使用1张嵌入表;
将文档页面图像分割成小图片序列,基于Faster R-CNN建模整张图片特征丰富[CLS]
预训练集 IIT-CDIP Test Collection 1.0(600万扫描件,含1200万扫描图片,含信件、邮件、表单、发票等)。
,10%随机替换,余下10%不变;9种embeddings相加,再经layer norm后作为第1层输入:
embeddings = (
+ position_embeddings
+ left_position_embeddings
+ upper_position_embeddings
+ right_position_embeddings
+ lower_position_embeddings
+ h_position_embeddings
+ w_position_embeddings
+ token_type_embeddings
与Layout LM的区别:
t i = TokEmb ( w i ) + PosEmb1D ( i ) + SegEmb ( s i ) \bm t_i=\text{TokEmb}(w_i) + \text{PosEmb1D}(i) + \text{SegEmb}(s_i) ti=TokEmb(wi)+PosEmb1D(i)+SegEmb(si)
将图像缩放至224x224,喂入ResNeXt-FPN编码(参数在预训练时更新),平均池化为 W × H W×H W×H的特征图(3维),展开为2维序列;
v i = Proj ( VisTokEmb ( I ) i ) + PosEmb1D ( i ) + SegEmb ( [ C ] ) , 0 ≤ i ≤ W H = L \bm v_i=\text{Proj}(\text{VisTokEmb}(I)_i) + \text{PosEmb1D}(i) + \text{SegEmb}([\text{C}]),\quad 0\leq i \leq WH=L vi=Proj(VisTokEmb(I)i)+PosEmb1D(i)+SegEmb([C]),0≤i≤WH=L
标准化点位至 [ 0 , 10000 ] [0, 10000] [0,10000], x , y x, y x,y点位各使用一个嵌入层,对于边界框 box i = ( x m i n , x m a x , y m i n , y m a x , w , h ) \text{box}_i=(x_{min},x_{max},y_{min},y_{max},w,h) boxi=(xmin,xmax,ymin,ymax,w,h),
l i = Concat ( PosEmb2D x ( x m i n , x m a x , w ) , PosEmb2D y ( y m i n , y m a x , h ) ) \bm l_i=\text{Concat}(\text{PosEmb2D}_x(x_{min},x_{max},w),\text{PosEmb2D}_y(y_{min},y_{max},h)) li=Concat(PosEmb2Dx(xmin,xmax,w),PosEmb2Dy(ymin,ymax,h))
使用 box PAD = ( 0 , 0 , 0 , 0 , 0 , 0 ) \text{box}_\text{PAD}=(0,0,0,0,0,0) boxPAD=(0,0,0,0,0,0),表示特殊token[CLS]
x i ( 0 ) = X i + l i , X = { v 0 , . . . , v W H − 1 , t 0 , . . . , t L − 1 } \bm x_i^{(0)}=X_i+\bm l_i,\quad X=\{\bm v_0,...,\bm v_{WH-1}, \bm t_0, ...,\bm t_{L-1}\} xi(0)=Xi+li,X={v0,...,vWH−1,t0,...,tL−1}
α i j = 1 d h e a d ( x i W Q ) ( x j W K ) ⊤ , α i , j ′ = α i j + b j − i 1 D + b x j − x i 2 D x + b y j − y i 2 D y , h i = ∑ j exp α i j ′ ∑ k exp α i k ′ x j W V \alpha_{ij}=\frac{1}{\sqrt{d_{head}}}(\bm x_i\bm W^Q)(\bm x_j\bm W^K)^{\top},\quad \alpha_{i,j}'=\alpha_{ij}+\bm b_{j-i}^{1D}+\bm b_{x_j-x_i}^{2D_x}+\bm b_{y_j-y_i}^{2D_y}, \quad \bm h_i=\sum_j\frac{\exp\alpha_{ij}'}{\sum_k\exp\alpha_{ik}'}\bm x_j\bm W^V αij=dhead1(xiWQ)(xjWK)⊤,αi,j′=αij+bj−i1D+bxj−xi2Dx+byj−yi2Dy,hi=j∑∑kexpαik′expαij′xjWV
或[Not Covered]
文档中某些元素(signs, bars)看起来很像是覆盖区域,图像中寻找词级别的覆盖区域噪音较大,整行覆盖可避免噪音。
语义实体识别任务, 关系抽取任务。
Semantic Entity Recognition | Relation Extraction |
![]() |
![]() |
Semantic Entity Recognition
Relation Extraction
language-specific fine-tuning
: 语言X上微调,语言X上测试;Zero-shot transfer learning
: 英文上微调,其他语言上测试;Multitask fine-tuning
: 所有语言上训练模型The different granularities of image (e.g., dense image pixels or contiguous region features) and text (i.e., discrete tokens) objectives further add difficulty to cross-modal alignment learning.
LayouLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked.
Represent document images with linear projection features of image patches, as following steps:
We insert semantic 1D relative position and spatial 2D relative position as bias term in self-attention networds for text and imga modalities following LayoutLMv2.
LayoutLMv3 learns to reconstruct masked word tokens of the text modality and symmetrically reconstruct masked patch tokens of the image modality.
Inspired BERT, mask 30% of text tokens with a span masking strategy with span lengths drawn from a Possion distribution ( λ = 3 \lambda=3 λ=3).
Making 40% of image tokens randomly with blockwise masking strategy that is a symmetry to the MLM objective.
MIM objective can transform dense image pixels into discrete tokens according to a visual vocabulary, that facilitates learning high-level layout structures rather than low-level noisy details.
引理 Image Tokenizer
基于discrete VAE训练,含固定图像词表,将图像转换为定长的离散tokens序列。
image tokens by discrete VAE (DALL-E、BEiT)
The WPA objective is to predict whether the corresponding image patch of a text word is masked.
具体地,对于未掩盖文本token,若其对应的image patch被掩盖,则标签为1,否则标签为0,不考虑掩盖文本token.