Following are some excerpts from the paper Deep Learning for Chinese Word Segmentation and POS Tagging by Xiaoqing Zheng et al.. Those excerpts summarize the main idea of the paper. The details require typing lots of mathematical formulas, which is time consuming, thus are omitted.
Paper name: Deep Learning for Chinese Word Segmentation and POS Tagging
Paper published time: 2013
Paper authors: Xiaoqing Zheng et al.
Key words: Deep Learning , Neural Networks, Chinese Word Segmentation, POS Tagging
Although the paper only applied deep learning on Chinese word segmentation (CWD) and POS tagging, the same method can also be applied to Chinese Named Entity Recognition (NER)
Previous studies show that joint solutions usually lead to the improvement in accuracy over pipelined systems by exploiting POS information to help word segmentation and avoiding error propagation. However, traditional joint approaches usually involve a great number of features.
The choice of features, therefore, is a critical success factor for these systems. Most of the state-of-the-art systems address their tasks by applying linear statistical models to the features carefully optimized for the tasks. This approach is effective because researchers can incorporate a large body of linguistic knowledge into the models. However, the approach does not scale well when it is used to perform more complex joint tasks, for example, the task of joint word segmentation, POS tagging, parsing, and semantic role labeling.
Traditional joint approaches arises four limitations:
The size of the result models is too large for practical use due to the storage and computing constraints of certain real-world applications.
The number of parameters is so large that the trained model is apt to overfit on training corpus.
A longer training time is required.
Instead, we use multilayer neural networks to discover the useful features from the input sentences. Two main contributions in this paper:
We describe a perceptron-style algorithm for training the neural networks, which not only speeds up the training of the networks with negligible loss in performance, but also can be implemented more easily;
We show that the tasks of Chinese word segmentation and POS tagging can be effectively performed by the deep learning. Our networks achieved close to state-of-the-art performance by transferring the unsupervised internal representations of Chinese characters into the supervised models.
In order to make learning algorithms less dependent on the feature engineering, we chose to use a variant of the neural network architecture first proposed by (Bengio et al., 2003) for probabilistic language model, and reintroduced later by (Collobert et al., 2011) for multiple NLP tasks.
The network takes the input sentence and discovers multiple levels of feature extraction from the inputs, with higher levels representing more abstract aspects of the inputs. The first layer extracts features for each Chinese character. The next layer extracts features from a window of characters. The following layers are classical neural network layers. The output of the network is a graph over which tag inference is achieved with a Viterbi algorithm.
The characters are fed into the network as indices that are used by a lookup operation to transform characters into their feature vectors. We consider a fixed-sized character dictionary. The vector representations are stored in a character embedding matrix.
The lookup operation can be seen as a simple projection layer. The feature vector of each character, starting from a random initialization, can be automatically trained by back propagation to be relevant to the task of interest.
In practice, it is common that one might want to provide other additional features that is thought to be helpful for the task. We associate a lookup table to each additional feature, and the character feature vector becomes the concatenation of the outputs of all these lookup tables.
For each character in a sentence, a score is produced for every tag by applying several layers of the neural network over the feature vectors produced by the lookup table layer.
We use a window approach to handle the sequences of variable sentence length. The characters with indices exceeding the sentence boundaries are mapped to one of two special symbols, namely “start” and “stop” symbols.
The tags are organized in chunks, and it is impossible for some tags to follow other tags. We introduce a transition score Aij for jumping from i ∈ T to j ∈ T tags in successive characters, and an initial scores A0i for starting from the i-th tag for taking into account the sentence structure.
The score of a sentence c[1:n] along a path of tags t[1:n] is then given by the sum of transition and network scores.
Given a sentence c[1:n], we can find the best tag path by maximizing the sentence score. The Viterbi algorithm can be used for this inference.
The network generally is trained by maximizing a log likelihood over all the sentences in the training set with respect to its parameters using gradient ascent algorithm. The gradient can be computed by a classical back propagation.
The Log-Likelihood training method is computationally expensive. The paper proposed a perceptron-style algorithm for training the neural networks.
Intuitively the new training algorithm have the effect of updating the parameter values in a way that increases the score of the correct tag sequence and decreases the score of the incorrect one output by the network with the current parameter settings. If the tag sequence produced by the network is correct, no changes are made to the values of parameters.