PVT(Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions)
1.网络框图1.输入一张H×W×3H\timesW\times3H×W×3的图片,经过一个PatchEmbeeding将其分割成HW42\frac{HW}{4^2}42HWpatches,每一个Patch大小是4×4×34\times4\times34×4×3,经过一个LinearProjection得到embeedingpatches:H×W42×C1\frac{H\timesW}{4^2}\t