Character-level Convolutional Networks for Text Classification:使用字符级别的卷积神经网络来做文本分类任务 好的baseline(NIPS 2015)
一、论文结构
Abstract:本文通过实验探究了字符级别的卷积神经网络用于文本分类的有效性,我们的模型能够取得非常好的分类结果
Introduction:卷积神经网络可以有效地从原始信号如图像语音中提取特征,并且字符级别的特征也常用于自然语言处理任务,本文在文本分类任务上探索字符级别的卷积神经网络的有效性
Comparison Models:介绍了一些对比文本分类模型,包括传统的词袋模型和基于深度学习的模型
Character-level Convolutional Networks:详细介绍了字符级别的卷积神经网络模型以及一种数据扩充方法
Large-scale Datasets and Results:详细介绍本文使用的几个文本分类数据集以及实验结果
Discussion:讨论实验结果以及一些参数或者设置对于结果的影响
Conclusion and Outlook:对全文进行总结并对未来进行展望
二、目标
(一)背景介绍
卷积神经网络
字符级别的特征
(二)CharTextCNN模型
模型讲解
数据增强
(三)对比模型
传统模型
基于深度学习模型
(四)实验结果与分析
数据集介绍
实验结果
结果讨论
(五)代码实现
二、背景知识
(一)卷积神经网络发展
(1)2012 mageNet
(2)2014 卷积神经网络用于声音特征提取
(3)2014 字符信息用于生成词表示
三、论文详解
Abstract
This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several largescale datasets to show that character-level convolutional networks could achievestate-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
1.本文从实验角度探索了字符级别卷积神经网络用于文本分类的有效性。
2.我们构造了几个大规模的文本分类数据集,实验结果表明我们的字符级别文本分类模型能够取得最好的或者非常有竞争力的结果。
3.对比模型包括传统的词袋模型、n-grams模型以及他们的tf-idf变体,还有一些基于深度学习的模型,包括基于卷积神经网络和循环神经网络的文本分类模型。
1 Introduction
文本分类简介:Text classification is a classic topic for natural language processing, inwhich one needs to assign predefined categories to free-text documents(对任何文本都能指定一个类别).文本分类的发展:The range of text classification research goes fromdesigning the best features to choosing the best possible machine learning classifiers(从设计好的特征到选取好的机器学习的分类器). To date, almost all techniques of text classification are based on words, inwhich simple statistics of some ordered word combinations (such as n-grams) usually perform the best(目前对顺序词的简单静态组合(比如ngram)表现的是不错的) [12].
CNN的有效性:On the other hand, many researchers have foundconvolutional networks (ConvNets) [17] [18] are useful in extracting information from raw signals(CNN在原始信号中提取特征非常有用),ranging fromcomputer vision(图像像素)applications tospeech recognition(语音识别)and others. In particular, time-delay networks used in the early days of deep learning research are essentially convolutional networks that model sequential data [1] [31].
文本使用的字符级别的CNN用于文本分类:In this article we explore treating text as a kind of raw signal at character level, and applyingtemporal (one-dimensional) ConvNets(时序的一维卷积) to it. For this article we only used a classification task as a way to exemplify ConvNets’ ability to understand texts. Historically we know that ConvNets usually require large-scale datasets to work(CNN通常需要很大的数据集训练), therefore we also build several of them. An extensive set of comparisons is offered with traditional models and other deep learning models.
文本分类在自然语言处理的应用:Applying convolutional networks to text classification or natural language processing at large was explored in literature. It has been shown that ConvNets can be directly applied todistributed[6] [16] or discrete(离散的,即one-hot)[13] embedding of words,withoutany knowledge on the syntactic or semanticstructures of a language(无需语言结构). These approaches have been proven to be competitive to traditional models(和传统统计模型有竞争力的结果).
字符级别信息的引用:There are also related works that use character-level features for language processing. These include using character-level n-grams with linear classifiers [15], and incorporating character-level features to ConvNets [28] [29]. In particular, these ConvNet approaches use words as a basis, in which character-level features extracted at word [28] or word n-gram [29] level form a distributed representation. Improvements for part-of-speech tagging and information retrieval were observed.
This article is the first to apply ConvNets only on characters. We show that when trained on largescale datasets, deep ConvNets do not require the knowledge of words(深度的卷积神经网络不需要词的信息), in addition to the conclusion from previous research thatConvNets do not require the knowledge about the syntactic or semantic structure of a language卷积神经网络不需要语义和语法的结构信息.优点:This simplification of engineering could be crucial for a single system that can work for different languages(通过简单的工程变换,就能用到其他语言), since characters always constitute a necessary construct regardless of whether segmentation into words is possible. Working on only characters also has the advantage that abnormal character combinations such as misspellings and emoticons may be naturally learnt.
1.文本分类是自然语言处理的基础任务之一,目前大多数文本分类任务都是基于词的。
2.卷积神经网络能够成功提取出原始信息中的特征,如图像和语音,于是本文在字符级别的数据上使用卷积神经网络来提取特征。
3.在文本上使用卷积神经网络已经很常见了,而且使用字符级别的特征来提高自然语言处理任务的性能也有很多的研究。
4.本文首次使用纯字符级别的卷积神经网络,我们发现我们的卷积神经网络不需要单词级别的信息就能够在大规模语料上得到很好的结果。
2 Character-level Convolutional Networks
In this section, we introduce the design of character-level ConvNets for text classification. The design is modular, where the gradients are obtained by back-propagation [27] to perform optimization.
只在feature里面做maxpooling
2.2 Character quantization
Our models accept a sequence of encoded characters as input. The encoding is done by prescribing an alphabet of size m for the input language, and then quantize each character using 1-of-m encoding (or “one-hot” encoding). Then, the sequence of characters is transformed to a sequence of such m sized vectors with fixed length l0(字符长度限制). Any character exceeding length l0 is ignored, and any characters that are not in the alphabet including blank characters are quantized as all-zero vectors. The character quantization order is backward so that the latest reading on characters is always placed near the begin of the output, making it easy for fully connected layers to associate weights with the latest reading. The alphabet used in all of our models consists of 70 characters, including 26 english letters, 10 digits, 33 other characters and the new line character. The non-space characters are:
abcdefghijklmnopqrstuvwxyz0123456789
-,;.!?:’’’/\|_@#$%ˆ&*˜‘+-=<>()[]{}
Later we also compare with models that use a different alphabet in which we distinguish between upper-case and lower-case letters.
2.3 Model Design We designed 2 ConvNets – one large and one small. They are both 9 layers deep with 6 convolutional layers and 3 fully-connected layers. Figure 1 gives an illustration.
The input have number of features equal to 70 due to our character quantization method, and theinput feature length is 1014. It seems that 1014 characters could already capture most of the texts of interest(1014基本可以取到文档感兴趣的字符). We also insert 2 dropout [10] modulesin between the 3 fully-connected layers to regularize. They havedropout probability of 0.5.Table 1 lists the configurations for convolutional layers, and table 2 lists the configurations for fully-connected (linear) layers.
For different problems the input lengths may be different (for example in our case l0 = 1014), and so are the frame lengths. From our model design, it is easy to know that given input length l0(限制长度), the output frame length after the last convolutional layer (but before any of the fully-connected layers) is l6 = (l0 − 96)/27. This number multiplied with the frame size at layer 6 will give the input dimension the first fully-connected layer accepts.
2.4 Data Augmentation using Thesaurus (使用同义词进行数据扩充)
Many researchers have found that appropriate data augmentation techniques are useful for controlling generalization error for deep learning models. These techniques usually work well when we could find appropriate invariance properties that the model should possess. In terms of texts, it is not reasonable to augment the data using signal transformations as done in image or speech recognition, because the exact order of characters may form rigorous syntactic and semantic meaning. Therefore,the best way to do data augmentation would have been using human rephrases of sentences, but this is unrealistic and expensive due the large volume of samples in our datasets. As a result, the most natural choice in data augmentation for us is to replace words or phrases with their synonyms。传统用于图像或者语音信号的数据增强方式不适用于nlp,最合理的方式是使用同义词来替换。
We experimented data augmentation by using an English thesaurus, which is obtained from the mytheas component used in LibreOffice1 project. That thesaurus in turn was obtained from WordNet [7], where every synonym to a word or phrase is ranked by the semantic closeness to the most frequently seen meaning. To decide on how many words to replace, we extract all replaceable words from the given text and randomly choose r of them to be replaced. The probability of number r is determined by a geometric distribution with parameter p in which P[r] ∼ p r . The index s of the synonym chosen given a word is also determined by a another geometric distribution in which P[s] ∼ q s. This way, the probability of a synonym chosen becomes smaller when it moves distant from the most frequently seen meaning. We will report the results using this new data augmentation technique with p = 0.5 and q = 0.5.(替换概率符合多项式分布,替换一个的概率是P^1,两个词P^2,替换越多概率越小)
3 Comparison Models
To offer fair comparisons to competitive models, we conducted a series of experiments with both traditional and deep learning methods. We tried our best to choose models that can provide comparable and competitive results, and the results are reported faithfully without any model selection.
3.1 Traditional Methods
We refer to traditional methods as those that using a hand-crafted feature extractor and a linear classifier. The classifier used is a multinomiallogistic regressionin all these models.
Bag-of-words and its TFIDF. For each dataset, thebag-of-words(统计出现的次数) model is constructed by selecting 50,000 most frequent words from the training subset.For the normal bag-of-words, we use the counts of each word as the features(普通的词袋模型就是使用词的个数作为特征). For the TFIDF (term-frequency inverse-document-frequency) [14] version, we use the counts as the term-frequency. The inverse document frequency is the logarithm of the division between total number of samples and number of samples with the word in the training subset. The features are normalized by dividing the largest feature value.
Bag-of-ngrams and its TFIDF. The bag-of-ngrams models are constructed by selecting the 500,000 most frequent n-grams (up to 5-grams) from the training subset for each dataset. The feature values are computed the same way as in the bag-of-words model.
Bag-of-means on word embedding. We also have an experimental model that uses k-means on word2vec [23] learnt from the training subset of each dataset, and then use these learnt means as representatives of the clustered words. We take into consideration all the words that appeared more than 5 times in the training subset. The dimension of the embedding is 300. The bag-of-means features are computed the same way as in the bag-of-words model. The number of means is 5000.
3.2 Deep Learning Methods
Recently deep learning methods have started to be applied to text classification. We choose two simple and representative models for comparison, in which one is word-based ConvNet and the other a simple long-short term memory (LSTM) [11] recurrent neural network model.
Word-based ConvNets. Among the large number of recent works on word-based ConvNets for text classification, one of the differences is the choice of using pretrained or end-to-end learned word representations. We offer comparisons with both using the pretrained word2vec [23] embedding [16] and using lookup tables [5]. The embedding size is 300 in both cases, in the same way as our bagof-means model. To ensure fair comparison, the models for each case are of the same size as our character-level ConvNets, in terms of both the number of layers and each layer’s output size. Experiments using a thesaurus for data augmentation are also conducted.
Long-short term memory. We also offer a comparison with a recurrent neural network model, namely long-short term memory (LSTM) [11]. The LSTM model used in our case is word-based, using pretrained word2vec embedding of size 300 as in previous models. The model is formed by takingmean of the outputs(输出取平均)of all LSTM cells to form a feature vector, and then using multinomial logistic regression on this feature vector. The output dimension is 512. The variant of LSTM we used is the common “vanilla” architecture [8] [9]. We also used gradient clipping [25] in which thegradient norm is limited to 5(梯度是五). Figure 2 gives an illustration.
4 Large-scale Datasets and Results
Previous research on ConvNets in different areas has shown that they usually work well with largescale datasets, especially when the model takes in low-level raw features like characters in our case. However, most open datasets for text classification are quite small, and large-scale datasets are splitted with a significantly smaller training set than testing [21]. Therefore, instead of confusing our community more by using them, we built several large-scale datasets for our experiments, ranging from hundreds of thousands to several millions of samples. Table 3 is a summary. Table 3: Statistics of our large-scale datasets. Epoch size is the number of minibatches in one epoch
(topic)AG’s news corpus. We obtained the AG’s corpus of news article on the web2 . It contains 496,835 categorized news articles from more than 2000 news sources. We choose the 4 largest classes from this corpus to construct our dataset, using only the title and description fields. The number of training samples for each class is 30,000 and testing 1900.
(topic)Sogou news corpus.This dataset is a combination of the SogouCA and SogouCS news corpora [32], containing in total 2,909,551 news articles in various topic channels. We then labeled each piece of news using its URL, by manually classifying the their domain names. This gives us a large corpus of news articles labeled with their categories. There are a large number categories but most of them contain only few articles. We choose 5 categories – “sports”, “finance”, “entertainment”, “automobile” and “technology”. The number of training samples selected for each class is 90,000 and testing 12,000. Although this is a dataset in Chinese, we used pypinyin package combined with jieba Chinese segmentation system to producePinyin – a phonetic romanization of Chinese(使用拼音才做的). The models for English can then be applied to this dataset without change. The fields used are title and content.
(topic)DBPedia ontology dataset.DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia [19]. The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014. From each of these 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. The fields we used for this dataset contain title and abstract of each Wikipedia article.
(语义)Yelp reviews. The Yelp reviews dataset is obtained from the Yelp Dataset Challenge in 2015. This dataset contains 1,569,264 samples that have review texts. Two classification tasks are constructed from this dataset – one predicting full number of stars the user has given, and the other predicting a polarity label by considering stars 1 and 2 negative, and 3 and 4 positive. The full dataset has 130,000 training samples and 10,000 testing samples in each star, and the polarity dataset has 280,000 training samples and 19,000 test samples in each polarity.
(topic)Yahoo! Answers dataset.We obtained Yahoo! Answers Comprehensive Questions and Answers version 1.0 dataset through the Yahoo! Webscope program. The corpus contains 4,483,032 questions and their answers. We constructed a topic classification dataset from this corpus using 10 largest main categories. Each class contains 140,000 training samples and 5,000 testing samples. The fields we used include question title, question content and best answer.
Amazon reviews. We obtained an Amazon review dataset from the Stanford Network Analysis Project (SNAP), which spans 18 years with 34,686,770 reviews from 6,643,669 users on 2,441,053 products [22]. Similarly to the Yelp review dataset, we also constructed 2 datasets – one full score prediction and another polarity prediction. The full dataset contains 600,000 training samples and 130,000 testing samples in each class, whereas the polarity dataset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment. The fields used are review title and review content.
5 Discussion
To understand the results in table 4 further, we offer some empirical analysis in this section. To facilitate our analysis, we present the relative errors in figure 3 with respect to comparison models. Each of these plots is computed by taking the difference between errors on comparison model and our character-level ConvNet model, then divided by the comparison model error. All ConvNets in the figure are the large models with thesaurus augmentation respectively.
Character-level ConvNet is an effective method. The most important conclusion from our experiments is that character-level ConvNets could work for text classification without the need for words. This is a strong indication that language could also be thought of as a signal no different from any other kind. Figure 4 shows 12 random first-layer patches learnt by one of our character-level ConvNets for DBPedia dataset.
Dataset size forms a dichotomy between traditional and ConvNets models. The most obvious trend coming from all the plots in figure 3 is that the larger datasets tend to perform better. Traditional methods like n-grams TFIDF remain strong candidates for dataset of size up to several hundreds of thousands, and only until the dataset goes to the scale of several millions do we observe that character-level ConvNets start to do better.
对噪音的处理更好ConvNets may work well for user-generated data.User-generated data vary in the degree of how well the texts are curated. For example, in our million scale datasets, Amazon reviews tend to be raw user-inputs, whereas users might be extra careful in their writings on Yahoo! Answers. Plots comparing word-based deep models (figures 3c, 3d and 3e) show that character-level ConvNets work better for less curated user-generated texts. This property suggests that ConvNets may have better applicability to real-world scenarios. However, further analysis is needed to validate the hypothesis that ConvNets are truly good at identifying exotic character combinations such as misspellings and emoticons, as our experiments alone do not show any explicit evidence.
Choice of alphabet makes a difference.Figure 3f shows that changing the alphabet by distinguishing between uppercase and lowercase letters could make a difference. For million-scale datasets, it seems that not making such distinction usually works better. One possible explanation is that there is a regularization effect, but this is to be validated.
语义关系不大:Semantics of tasks may not matter. Our datasets consist of two kinds of tasks: sentiment analysis (Yelp and Amazon reviews) and topic classification (all others). This dichotomy in task semantics does not seem to play a role in deciding which method is better.
Bag-of-means is a misuse of word2vec [20].One of the most obvious facts one could observe from table 4 and figure 3a is that the bag-of-means model performs worse in every case. Comparing with traditional models, this suggests such a simple use of a distributed word representation may not give us an advantage to text classification. However, our experiments does not speak for any other language processing tasks or use of word2vec in any other way.
There is no free lunch.Our experiments once again verifies that there is not a single machine learning model that can work for all kinds of datasets. The factors discussed in this section could all play a role in deciding which method is the best for some specific application.
缺点:
字符级别的文本长度特别长,不利于处理长文本的分类
只使用字符级别信息,所以模型学习到的语义方面的信息较少
在小语料上效果较差
优点:
模型结构简单
可以用于各种语言,不需要做分词处理
在噪音毕竟多的文本上表现较号,因为基本不存在OOV的问题
四、研究成果及意义
构造了几个大的文本分类数据集,这些数据集成为了文本分类最常用的一些数据集
在多个数据集上取得了最好的或者非常有竞争力的结果
历史意义
构建了多个文本分类数据集,极大地推动了文本分类的研究工作
提出的CharTextCNN方法因为只使用的字符信息,所以可以用于多种语言
启发点
基于卷积神经网络的文本分类不需要语言的语法和语义结构的知识
实验结构告诉我们没有一个机器学习模型能够在各种数据集上都表现的最好。
本文从实验的角度分析字符级别卷积神经网络在文本分类任务上的适用性。