22-数据增强Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition

数据增强范畴

和Open Relation and Event Type Discovery with Type Abstraction这篇文章,有些类似

文章目录

  • 数据增强范畴
  • background
  • 一、Model
      • model structure
        • paraphrase generation
        • cycle-consistent reconstruction
  • 二、数据选择
  • 实验
  • 总结


background

In the few shot scenarios, increase the size and diversity of the training data.

Recent study:(1)Rule based method (2) the potential of leveraging data from high-resource tasks.

core idea: change the style related attributes of text while its semantics.
the paper formulated the task as paraphrase generation problem.

一、Model

According to the data parallel or not, the paper proposes two way to reslove the question.
For the parallel data, use the paraphrase generation model to brige the gap between the source and target.
For the nonparallel data, use a cycle consistent reconstruction to re-paraphrase back the paraphrased sentences to its original style(有点绕呀)

model structure

paraphrase generation

Two loss function:
L pg:对NER中的label做损失函数,判断模型预测的BIO标签是否正确。
L adv:对抗学习判断input和它的的释义之间的相似度。用Discrimination做的判断。

22-数据增强Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition_第1张图片

cycle-consistent reconstruction

Process

First: generator Gθ to generate the paraphrase y˜cycle of
the input sentence xcycle concatenated with a prefix.
Second: we concatenate the paraphrase y˜cycle
with a different prefix as the input to the generator
Gθ and let it transfer the paraphrase back to the
original sentence yˆcycle

Loss contains two part.
22-数据增强Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition_第2张图片
22-数据增强Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition_第3张图片

22-数据增强Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition_第4张图片

二、数据选择

即使有了有效的结构,生成的句子仍然可能不可靠,因为它可能因为退化的重复和不连贯的胡言乱语而质量不高(Holtzman等人,2020;Welleck等人,2020)。为了缓解这个问题,我们进一步用以下指标进行数据选择。

  • 一致性:来自预训练的 style classifier的confidence score,作为生成句子在target style中的程度。
  • 适当性:由预先训练好的NLU模型对生成的句子保留多少语义进行的信心评分。
  • 流畅性:来自预训练的NLU模型的信心分数,表明生成的句子的流畅性。
  • 多样性:原始句子和生成句子之间在字符层面上的编辑距离。
    对于每个句子,我们过度生成k=10个候选人。我们计算上述指标(详见附录C),并将这些指标的加权分数分配给每个候选人。然后我们用这个分数对所有的候选者进行排名,并选择最好的一个来训练NER系统。

实验

  • 不同的数据增强策略模型的效果。

The source data involves five different domains in the formal style: broadcast conversation (BC), broadcast news (BN), magazine (MZ),newswire (NW), and web data (WB) while the target data involves only social media (SM) domain in the informal style

22-数据增强Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition_第5张图片

  • 不同因素在对实验性能的影响(消融研究)
    22-数据增强Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition_第6张图片

总结

不觉得繁杂吗?相比于提问题做数据增强,这个方法还要分parallel与否,分开建立模型。

你可能感兴趣的:(EMNLP,人工智能)