使用stanford-classifier遇到的几个问题及解决方法

使用stanford-classifier遇到的几个问题及解决方法_第1张图片

作者:金良([email protected]) csdn博客: http://blog.csdn.net/u012176591

The Stanford Classifier is a general purpose classifier - something that takes a set of input data and assigns each of them to one of a set of categories. It does this by generating features from each datum which are associated with positive or negative numeric “votes” (weights) for each class. In principle, the weights could be set by hand, but the expected use is for the weights to be learned automatically based on hand-classified training data items. (This is referred to as “supervised learning”.) The classifier can work with (scaled) real-valued and categorical inputs, and supports several machine learning algorithms. It also supports several forms of regularization, which is generally needed when building models with very large numbers of predictive features.

You can use the classifier on any sort of data, including standard statistics and machine learning data sets. But for small data sets and numeric predictors, you’d generally be better off using another tool such as R, Weka or scikit-learn. Where the Stanford Classifier shines is in working with mainly textual data, where it has powerful and flexible means of generating features from character strings. However, if you’ve also got a few numeric variables, you can throw them in at the same time.

1.stanford-classifier入手

stanford-classifier是开源软件,实现了最大熵分类器。其主页是http://nlp.stanford.edu/software/classifier.shtml,提供了下载地址,下载后的解压目录:
使用stanford-classifier遇到的几个问题及解决方法_第2张图片

其中 stanford-classifier.jar 是可执行的jar文件;对stanford-classifier-3.5.2-sources.jar 进行解压得到Java源码,可以进行二次开发。

网页 http://www-nlp.stanford.edu/wiki/Software/Classifier 给出了几个用例演示,可以作为入手练习。

注意Java版本要1.8,Java1.7是不行的。

2.用eclipse进行二次开发

二次开发可以借助eclipse ,将源码拷贝到eclipse项目时,有两个Java文件提示有误,找到错误所在行发现是中文乱码。解决办法是用记事本打开然后保存为ANSI 编码。eclipse项目如下所示:

使用stanford-classifier遇到的几个问题及解决方法_第3张图片

main函数所在的文件名是ColumnDataClassifier.java ,在包 edu.stanford.nlp.classify. 下。

用eclipse导出可执行的jar文件,执行方式与下载包中所包含的可执行包用法相同。

使用stanford-classifier遇到的几个问题及解决方法_第4张图片

3.Makefile文件来编译jar文件

解压目录下有个Makefile文件,执行make命令编译时默认会用到该文件。内容如下

# This is a rudimentary Makefile for rebuilding the classifier.
# We actually use ant (q.v.) or a Java IDE.

JAVAC = javac
JAVAFLAGS = -O -d classes

classifier:
mkdir -p classes
$(JAVAC) $(JAVAFLAGS) src/edu/stanford/nlp/*/*.java src/edu/stanford/nlp/*/*/*.java src/edu/stanford/nlp/*/*/*/*.java

cd classes ;
jar -cfm ../stanford-classifier-new.jar ../src/edu/stanford/nlp/classify/classifier-manifest.txt edu ; cd ..
cp stanford-classifier-new.jar stanford-classifier.jar
rm -rf classes

命令行进入该文件所在目录,执行如下命令进行编译

make

Makefile 文件内容的解释:

  • 1.$(JAVAC) $(JAVAFLAGS) src/edu/stanford/nlp/*/*.java src/edu/stanford/nlp/*/*/*.java src/edu/stanford/nlp/*/*/*/*.java
    (JAVAC)javacJavaclass (JAVAFLAGS)即-O -d classes 表示将编译生成的class文件存放在classes文件夹下;src/edu/stanford/nlp/*/*.java,src/edu/stanford/nlp/*/*/*.java 和src/edu/stanford/nlp/*/*/*/*.java表示要编译的Java文件地址,之所以用三个是因为Java源码有三个深度,从第二部分的eclipse项目可以看出。
  • 2.jar -cfm ../stanford-classifier-new.jar ../src/edu/stanford/nlp/classify/classifier-manifest.txt edu ;
    此命令将class文件封装成可执行的jar包(stanford-classifier-new.jar),其中 classifier-manifest.txt 指示了main函数文件的地址,只有一行内容,为 Main-class: edu.stanford.nlp.classify.ColumnDataClassifier,因为该项目的main函数所在的文件名是ColumnDataClassifier.java

编译命令的三个输入参数src/edu/stanford/nlp/*/*.java,src/edu/stanford/nlp/*/*/*.java 和src/edu/stanford/nlp/*/*/*/*.java 一个都不能少,否则会找不到类文件出现以下错误:
使用stanford-classifier遇到的几个问题及解决方法_第5张图片
这里之所以错误提示找不到edu.stanford.nlp.trees.tregex 程序包,就是因为编译命令的参数没有给全。

如果没有classifier-manifest.txt文件,会提示如下错误:
使用stanford-classifier遇到的几个问题及解决方法_第6张图片

该项目某些Java文件的编码格式是UTF-8(经查看是含有中文的Java文件),有些Java文件的编码格式是unicode,这两种编码在命令行编译时都会出错

这里写图片描述

解决办法是用记事本将这些文件打开,不用修改,只需要将它们保存为ANSI编码格式即可,如下:
这里写图片描述

你可能感兴趣的:(maximum,classifier,entropy)