XF-Event-Extraction学习笔记

讯飞事件抽取挑战赛

http://challenge.xfyun.cn/topic/info?type=hotspot
https://github.com/WuHuRestaurant/xf_event_extraction2020Top1

感觉这个pipeline方案核心在于流程和数据处理方面，所以可借鉴的点在于流程和各种trick上

image.png

流程大概就是下面这样:

image.png

也就是说先去提取触发词提取(TriggerExtractor)，然后训练论元提取(Role1Extractor和Role2Extractor)，最后是属性分类(AttributionClassifier)，训练顺序也就是这个了。

1. 触发词提取(`TriggerExtractor`)

输入数据处理部分

数据集标注的trigger：
1. 输入的文本按字位置转为[0, 0]初始化，如：[[0, 0], ..., [0, 0], ..., [0, 0]]
2. 对于输入的trigger文本，根据其在原文中的位置，对start和end位置的label进行改变，如：[[0, 0], ..., [0, 0], [1, 0], [0, 1], [0, 0], ..., [0, 0]]
远程监督标注的trigger
1. 按字位置构建distant_trigger_label，如：[0, 0, ..., 0]
2. 遍历远程监督的trigger文本，找到start和end的index，将distant_trigger_label对应index位置的值变为1

至此，构建完成feature：

image.png

模型结构

输入的文本过bert得到句子表示
如果选择用远程监督的特征，则通过self.distant_trigger_embedding(distant_trigger)获取特征后（其实就是0或1的意思），通过torch.cat([seq_out, distant_trigger_feature], dim=-1)与bert的输出拼在一起
接一层dnn
输出每个字位置的二分类，这里用的sigmoid，其实就是想-> [0, 0] or [0, 1] (end_idx) or [1, 0] (start_idx)

所以说整体还是按照序列标注的思路来完成的trigger抽取任务，这里的远程监督的信息也是直接并入到每个字位置，从而对最终的输出产生影响。

image.png

# 特征，用一个embedding层，这样可以在训练中优化对特征的表示
if use_distant_trigger:
    embedding_dim = kwargs.pop('embedding_dims', 256)
    self.distant_trigger_embedding = nn.Embedding(num_embeddings=2, embedding_dim=embedding_dim)
    out_dims += embedding_dim

mid_linear_dims = kwargs.pop('mid_linear_dims', 128)

# 一个dnn
self.mid_linear = nn.Sequential(
    nn.Linear(out_dims, mid_linear_dims),
    nn.ReLU(),
    nn.Dropout(dropout_prob)
)

self.classifier = nn.Linear(mid_linear_dims, 2)

self.activation = nn.Sigmoid()

self.criterion = nn.BCELoss()

init_blocks = [self.mid_linear, self.classifier]

if use_distant_trigger:
    init_blocks += [self.distant_trigger_embedding]

self._init_weights(init_blocks, initializer_range=self.bert_config.initializer_range)

定义下forward

# 获取bert输出的句子表示
bert_outputs = self.bert_module(
    input_ids=token_ids,
    attention_mask=attention_masks,
    token_type_ids=token_type_ids
)
seq_out = bert_outputs[0]

# 特征embedding
if self.use_distant_trigger:
    assert distant_trigger is not None, \
        'When using distant trigger features, distant trigger should be implemented'

    distant_trigger_feature = self.distant_trigger_embedding(distant_trigger)
    # distant_trigger_feature shape: torch.Size([4, 256, 256])
    seq_out = torch.cat([seq_out, distant_trigger_feature], dim=-1)
    # seq_out.shape torch.Size([4, 256, 1024])

# 对（可能）结合特征的表示过一层dnn
seq_out = self.mid_linear(seq_out)
# seq_out : torch.Size([4, 256, 128])

# 这里用的是Sigmoid
# logits shape: torch.Size([4, 256, 2])
logits = self.activation(self.classifier(seq_out))

out = (logits,)

if labels is not None:
    loss = self.criterion(logits, labels.float())
    out = (loss,) + out

2. 论元提取(`Role1Extractor`和`Role2Extractor`)

这里根据项目，将subject与object并为一组、将time与loc并为一组分别处理。

输入部分

image.png

模型结构

文本输入bert获取句子表示
根据trigger的索引，获取其bert表示（start-idx和end-idx，因此是768*2=1536）
将句子的表示与上文trigger的表示根据conditional_layer_norm组合，获取处理后的表示
如果选择trigger距离的特征，则通过self.trigger_distance_embedding(trigger_distance)获取特征，这里nn.Embedding(num_embeddings=512, embedding_dim=256)，我认为num_embeddings选择512是因为bert支持最长为512，因此类似于位置embedding直接支持最大的长度（待测试：如果这个长度小于句子最大长度？测试后：会报错）
concat上面第3的表示和第4的特征
对于subject和object，通过sigmoid激活并logits = torch.cat([obj_logits, sub_logits], dim=-1)，其实这里逻辑与上面相似：
- subject
  - [0 0 1 0]->start index
  - [0 0 0 1]->end index
- object
  - [1 0 0 0]->start index
  - [0 1 0 0]->end index

模型结构如下，我认为有一个地方箭头画错了

image.png

# seq_out.shape:  torch.Size([4, 256, 768])
# pooled_out.shape:  torch.Size([4, 768])
seq_out, pooled_out = bert_outputs[0], bert_outputs[1]

# trigger_label_feature.shape:  torch.Size([4, 2, 768])
# 这里是将对应位置的向量拿出来作为特征
# 其实这里面有个小问题，在比赛的数据中，每个句子只有一个trigger，如果是多个trigger这里需要修改
trigger_label_feature = self._batch_gather(seq_out, trigger_index)

# trigger_label_feature.shape:  torch.Size([4, 1536])
trigger_label_feature = trigger_label_feature.view([trigger_label_feature.size()[0], -1])

# seq_out.shape:  torch.Size([4, 256, 768])
seq_out = self.conditional_layer_norm(seq_out, trigger_label_feature)

if self.use_trigger_distance:
    assert trigger_distance is not None, \
        'When using trigger distance features, trigger distance should be implemented'
    # trigger_distance_feature.shape:  torch.Size([4, 256, 256])
    trigger_distance_feature = self.trigger_distance_embedding(trigger_distance)

    # seq_out.shape:  torch.Size([4, 256, 1024])
    seq_out = torch.cat([seq_out, trigger_distance_feature], dim=-1)

    # seq_out.shape:  torch.Size([4, 256, 1024])
    seq_out = self.layer_norm(seq_out)

    # seq_out = self.dropout_layer(seq_out)
# seq_out.shape:  torch.Size([4, 256, 128])
seq_out = self.mid_linear(seq_out)
print("5. seq_out.shape: ", seq_out.shape)

obj_logits = self.activation(self.obj_classifier(seq_out))
sub_logits = self.activation(self.sub_classifier(seq_out))

# torch.Size([4, 256, 4])
logits = torch.cat([obj_logits, sub_logits], dim=-1)
print("logits.shape: ", logits.shape)
out = (logits,)

if labels is not None:
    masks = torch.unsqueeze(attention_masks, -1)

    labels = labels.float()
    obj_loss = self.criterion(obj_logits * masks, labels[:, :, :2])
    sub_loss = self.criterion(sub_logits * masks, labels[:, :, 2:])

    loss = obj_loss + sub_loss

    out = (loss,) + out

return out

对于Role2Extractor，作者提供了CRF的方式：

image.png

3. 属性分类(`AttributionClassifier`)

这个代码好像有个bug，待排查

image.png

trigger_index:  tensor([[119, 120]], device='cuda:6')
labels:  tensor([[0, 0]], device='cuda:6')

其中，label的构成为：labels = [tense2id[raw_label[0]], polarity2id[raw_label[1]]]，这里

tense: "map": { "过去": 0, "将来": 1, "其他": 2, "现在": 3 }
polarity2id: "map": { "肯定": 0, "可能": 1, "否定": 2 }

pooling_mask的构成为：

# 左右各取 20 的窗口作为 trigger 触发的语境
pooling_masks_range = range(max(1, trigger_loc[0] - window_size),
                            min(min(1 + len(raw_text), max_seq_len - 1), trigger_loc[1] + window_size))

pooling_masks = [0] * max_seq_len
for i in pooling_masks_range:
    pooling_masks[i] = 1
for i in range(trigger_loc[0], trigger_loc[1] + 1):
    pooling_masks[i] = 0

假设trigger_loc为[119, 120]，那么得到pooling_mask为：tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,...,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0...

因此，

通过bert获取句子表示
通过self._batch_gather(seq_out, trigger_index)获取trigger的embedding向量作为trigger_label_feature
（接1）通过view和transpose后得到shape为(bs, hidden, seq_len)的句子表示向量
（接3）然后对该向量用nn.AdaptiveMaxPool1d进行池化得到pooled_out
通过torch.cat([pooled_out, trigger_label_feature], dim=-1)获取logits
分别获得polarity_logits和tense_logits，分别做softmax
分别计算loss后将两个loss相加

XF-Event-Extraction学习笔记

讯飞事件抽取挑战赛

1. 触发词提取(TriggerExtractor)

输入数据处理部分

模型结构

2. 论元提取(Role1Extractor和Role2Extractor)

输入部分

模型结构

3. 属性分类(AttributionClassifier)

你可能感兴趣的:(XF-Event-Extraction学习笔记)

1. 触发词提取(`TriggerExtractor`)

2. 论元提取(`Role1Extractor`和`Role2Extractor`)

3. 属性分类(`AttributionClassifier`)