Unk替换细节修改:若shuffle ratio=1.0
(1)编码后的input_id:
tensor([[ 101, 1996, 2006, 1996, 7195, 1997, 5409, 1011, 1011, 102],
[ 101, 7842, 14194, 1997, 2100, 102, 0, 0, 0, 0],
(2)对应的unk_mask:
tensor([[0, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 0, 0, 0, 0, 0],
(3)错误替换后:
tensor([[ 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 9], unk
pos ord: [ 0, 1024, 1024, 1024, 1024, 5, 6, 7, 8, 9],
换了种写法:
pos ord结果:[ 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0]
具体过程:
第一步:(1-[0, 1, 1, 1, 1, 0, 0, 0, 0, 0])* [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] =[0, 0, 0, 0, 0, 5, 6, 7, 8, 9]
第二步:[0, 1, 1, 1, 1, 0, 0, 0, 0, 0] * 1024=[0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0]
第三步:((1)+(2))*unk_mask = [0, 1024, 1024, 1024, 1024, 5, 6, 7, 8, 9]
第四步:[0, 1024, 1024, 1024, 1024, 5, 6, 7, 8, 9]* [0, 1, 1, 1, 1, 0, 0, 0, 0, 0]= [ 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0]
结果:[ 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0]
101,102的CLS和SEP也成了0,实际上,这两个token并没有什么实际的意思,也可以。