Abstract

中文分词（CWS）有很多不同的分词标准criterion，这篇文章就是想要利用对抗学习，提取多种不同的标准中的共享知识。

In this paper, we propose adversarial multi-criteria learning for CWS by integrating shared knowledge from multiple heterogeneous segmentation criteria.

以前也有类似利用多个corpora的方法，不过大多都只是利用linear classifier with discrete features。这篇文章其实就是一个multi-task任务，他把每个分词标准当作一个task，然后有三个不同的share-private models：shared / private layer，提取与标准无关／相关的特征。用对抗的方法确保共享层提取common underlying and criteria-invariant features。

The contributions of this paper could be summarized as follows.

• Multi-criteria learning is first introduced for CWS, in which we propose three shared-private models to integrate multiple segmentation criteria.

• An adversarial strategy is used to force the shared layer to learn criteria-invariant features, in which a new objective function is also proposed instead of the original cross-entropy loss.

• We conduct extensive experiments on eight CWS corpora with different segmentation criteria, which is by far the largest number of datasets used simultaneously.

Methods

对每个字符标记 {B, M, E, S} (begin, middle, end, single)。普通结构：character embedding layer -> feature layers (BLSTM) -> tag inference layer (CRF).

Adversarial Multi-Criteria Learning for Chinese Word Segmentation_第2张图片

Three shared-private models for multi-criteria learning. The yellow blocks are the shared BLSTM while the gray blocks are private BLSTM. The yellow circles are shared embedding. The red information flow indicates the difference between three models.