Text classification typically involves assigning a document to a category by automated or human means. LingPipe provides a classification facility that takes examples of text classifications--typically generated by a human--and learns how to classify further documents using what it learned with language models. There are many other ways to construct classifiers, but language models are particularly good at some versions of this task.
什么是文本分类?
文本分类通常指的是把一个文档自动或者按照人的意愿去归类。LingPipe 提供了基于人为已经分好类的文本根据语言模型去学习自动分类。有很多方法都可以构建分类器,但是语言模型对于分类器的构建很有帮助。
A publicly available data set to work with is the 20 newsgroups data available from the20 Newsgroups Home Page
从这可以下载一个包含20个新闻包的公开数据集。
We have included a sample of 4 newsgroups with the LingPipe distribution in order to allow you to run the tutorial out of the box. You may also download and run over the entire 20 newsgroup dataset. LingPipe's performance over the whole data set is state of the art.
为了让用户能够顺利的看着本教程运行LingPipe的发布版本,我们的例子里面已经包含了4个新闻包。你也可以下载完整的新闻包(20个)然后在完整的数据集上运行LingPipe。LingPipe运行在完整的数据集上的效果会更好。
Once you have downloaded and installed LingPipe, change directories to the one containing this read-me:
如果你已经下载并且安装了LingPipe,进入跟目录(包含ReadMe的目录):
cd demos/tutorial/classify
You may then run the demo from the command line (placing all of the code on one line):
然后你就可以在命令行下运行我们的示例了(没有换行):
On Windows:(在windows下)
java -cp "../../../lingpipe-4.1.0.jar; classifyNews.jar" ClassifyNews
or through Ant:(或者通过终端)
ant classifyNews
The demo will then train on the data in demos/fourNewsGroups/4news-train/
and evaluate on demos/4newsgroups/4news-test
. The results of scoring are printed to the command line and explained in the rest of this tutorial.
自带的示例将会在 demos/fourNewsGroups/4news-train/这四个训练集中进行训练,在 demos/4newsgroups/4news-test中进行测试评估。最终得分结果会打印在命令行窗口中,在接下来的教程中我们会对这个结果进行解释。
The entire source for the example is ClassifyNews.java. We will be using the API from Classifier and its subclasses to train the classifier, and Classifcation to evaluate it. The code should be pretty self explanatory in terms of how training and evaluation are done. Below I go over the API calls.
ClassifyNews.java 是整个分类的源文件。我们要用Classifier 以及它的子类去训练评估分类器。你可以通过阅读我们提供的规范的代码去了解怎么实现训练和评估分类器的。接下来我们开始学习分类器的API。
We are going to train up a set of character based language models (one per newsgroup as named in the static array CATEGORIES
) that processes data in 6 character sequences as specified by the NGRAM_SIZE
constant.
根据语言模型训练出一组特征值(每一个新闻集都被命名进静态数组 CATEGORIES)that processes data in 6 character sequences as specified by the NGRAM_SIZE
constant.
private static String[] CATEGORIES = { "soc.religion.christian", "talk.religion.misc", "alt.atheism", "misc.forsale" }; private static int NGRAM_SIZE = 6;
The smaller your data generally the smaller the n-gram sample, but you can play around with different values--reasonable ranges are from 1 to 16 with 6 being a good general starting place.
通常情况下你的训练数据集越小,你的一阶马尔科夫链(n-gram)样本集就越小,但是你可以合理的权值范围范围1到16内选取起始值,6就是一个不错的起始值。
The actual classifier involves one language model per classifier. In this case, we are going to use process language models (LanguageModel.Process
). There is a factory method in DynamicLMClassifier
to construct actual models.
实际上每个分类器都有一个语言分类模型。在这种情况下,我们才能处理语言模型。 DynamicLMClassifier是一个构建实际模型的动态工厂方法。
DynamicLMClassifier classifier = DynamicLMClassifier .createNGramBoundary(CATEGORIES, NGRAM_SIZE);
There are two other kinds of language model classifiers that may be constructed, for bounded character language models and tokenized language models.
还可以构造另外其他的两类分类器,边界特征集语言模型和标记语言模型。
Training a classifier simply involves providing examples of text of the various categories. This is called through the handle
method after first constructing a classification from the category and a classified object from the classification and text:
简单的通过提供每个分类的示例文本来训练一个分类器。
Classification classification = new Classification(CATEGORIES[i]); Classified<CharSequence> classified = new Classified<CharSequence>(text,classification); classifier.handle(classified);
That's all you need to train up a language model classifier. Now we can see what it can do with some evaluation data.
这就已经完成了一个语言模型的分类器的训练。现在怎么利用测试数据测试分类器呢。
The DynamicLMClassifier is pretty slow when doing classification so it is generally worth going through a compile step to produce the more efficient compiled version, which will classify character sequences into joint classification results. A simple way to do that is in the code as:
DynamicLMClassifier 这个动态类在进行分类的时候是相当慢的,所以对分类器进行联合编译是很有必要的。如代码所示:
JointClassifier<CharSequence> compiledClassifier = (JointClassifier<CharSequence>) AbstractExternalizable.compile(classifier);
Now the rubber hits the road and we can can see how well the machine learning is doing. The example code both reports classifications to the console and evaluates the performance. The crucial lines of code are:
现在一切准备就绪,我们看一下机器是怎么自动学习的。示例代码包括输出分类信息和评估测试。关键代码如下:
JointClassification jc = compiledClassifier.classifyJoint(text); String bestCategory = jc.bestCategory(); String details = jc.toString();
The text is an article that was not trained on and the JointClassification is the result of evaluating the text against all the language models. Contained in it is a bestCategory()
method that returns the highest scoring language model name for the text. Just to be sure that some statistics are involved the toString()
method dumps out all the results and they are presented as:
找一篇没有训练过的文章,通过JointClassification对这个文章在 所有语言模型上进行评估测试。在这个类中的bestCategory() 方法可以针对这个文章返回一个得分最高的分类语言模型。统计结果会通过toString()方法输出,如下:
Testing on soc.religion.christian/21417 Best Cat: soc.religion.christian Rank Cat Score P(Cat|In) log2 P(Cat,In) 0=soc.religion.christian -1.56 0.45 -1.56 1=talk.religion.misc -2.68 0.20 -2.68 2=alt.atheism -2.70 0.20 -2.70 3=misc.forsale -3.25 0.13 -3.25
The remaining API of note is how the system is scored against a gold standard. In this case our testing data. Since we know what newsgroup the article came from we can evaluate how well the software is doing with the JointClassifierEvaluator class.
剩余的API主要说明系统的黄金标准是怎么得分的。在测试数据中,我们知道新闻集的分类以及来源,但是软件是怎么做到的呢。
boolean storeInputs = true; JointClassifierEvaluator<CharSequence> evaluator = new JointClassifierEvaluator<CharSequence>(compiledClassifier, CATEGORIES, storeInputs);
This class wraps the compiledClassifier
in an evaluation framework that provide very rich reporting of how well the system is doing. Later in the code it is populated with data points with the method addCase()
, after first constructing a classified object as for training:
这个类封装在一个提供了非常丰富的系统是如何好做报告的编制分类评估框架。在后面的代码是填充的方法addCase(数据点),之后先构造一个分类对象的培训:
Classification classification = new Classification(CATEGORIES[i]); Classified<CharSequence> classified = new Classified<CharSequence>(text,classification); evaluator.handle(classified);
This will get a JointClassification for the text and then keep track of the results for reporting later. After all the data is run, then many methods exist to see how well the software did. In the demo code we just print out the total accuracy via the ConfusionMatrix class, but it is well worth looking at the relevant Javadoc for what reporting is available.
There's an ant target crossValidateNews
which cross-validates the news classifier over 10 folds. Here's what a run looks like:
ant的目标是对10倍以上的新闻进行交叉验证。运行结果如下:
> cd $LINGPIPE/demos/tutorial/classify > ant crossValidateNews Reading data. Num instances=250. Permuting corpus. FOLD ACCU 0 1.00 +/- 0.00 1 0.96 +/- 0.08 2 0.84 +/- 0.14 3 0.92 +/- 0.11 4 1.00 +/- 0.00 5 0.96 +/- 0.08 6 0.88 +/- 0.13 7 0.84 +/- 0.14 8 0.88 +/- 0.13 9 0.84 +/- 0.14
原文:http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
英语太菜了,接下来的吃力了