Multimodal Contrastive Training for Visual Representation Learning
parameterizetheimageencoderasfiq_{iq}iqqueryfeatureqii_{ii}ii,keyfeaturekii_{ii}iiparameterizethetextualencoderasfcq(⋅;Θq,Φcq)f_{cq}(·;Θ_q,Φ_{cq})fcq(⋅;Θq,Φcq),momentumtextualencoderasfck(⋅;Θk,Φik)f_{