weka特征预处理的一些tip
首先,提供两个地址,这里包含了全部的内容原文:http://weka.wikispaces.com/Text+categorization+with+Weka
http://weka.wikispaces.com/ARFF+files+from+Text+Collections
weka可以以目录形式读入数据。
然后再简单说一下weka在做文本特征内容处理时候需要注意的东西:
声明一点,在weka的gui下是没法使用这个功能的:以目录形式读入数据。
首先,把要处理的数据写入到这样的目录结构下:
... | +- text_example | +- class1 | | | + file1.txt | | | + file2.txt | | | ... | +- class2 | | | + another_file1.txt | | | + another_file2.txt | | | ...
然后在源码包下,命令行执行 java weka.core.converters.TextDirectoryLoader -dir text_example > text_example.arff
其中text_example就是数据所在的目录,而后面的arff文件就是生成的arff文件。另外值得补充的一点是在获得这样的arff后哦,文本内容是作为一个字符串特征存在的,也就是说生成的arff就是一个特征项加一个类标签,其中的类标就是text_example目录下级classX子目录的名字。为了更方便使用,weka提供了一个有监督的属性过滤器,帮助分词(这里指英文的split) ——StringToWordVector,这个是可以做TF/IDF的~~~
下面的简单代码可以完成一个分类:
1
import
weka.core.
*
;
2 import weka.core.converters. * ;
3 import weka.classifiers.trees. * ;
4 import weka.filters. * ;
5 import weka.filters.unsupervised.attribute. * ;
6
7 import java.io. * ;
8
9 /** */ /**
10 * Example class that converts HTML files stored in a directory structure into
11 * and ARFF file using the TextDirectoryLoader converter. It then applies the
12 * StringToWordVector to the data and feeds a J48 classifier with it.
13 *
14 * @author FracPete (fracpete at waikato dot ac dot nz)
15 */
16 public class TextCategorizationTest {
17
18 /** *//**
19 * Expects the first parameter to point to the directory with the text files.
20 * In that directory, each sub-directory represents a class and the text
21 * files in these sub-directories will be labeled as such.
22 *
23 * @param args the commandline arguments
24 * @throws Exception if something goes wrong
25 */
26 public static void main(String[] args) throws Exception {
27 // convert the directory into a dataset
28 TextDirectoryLoader loader = new TextDirectoryLoader();
29 loader.setDirectory(new File("./text_example"));
30 Instances dataRaw = loader.getDataSet();
31 System.out.println("\n\nImported data:\n\n" + dataRaw.numClasses());
32
33 // apply the StringToWordVector
34 // (see the source code of setOptions(String[]) method of the filter
35 // if you want to know which command-line option corresponds to which
36 // bean property)
37 StringToWordVector filter = new StringToWordVector();
38 filter.setInputFormat(dataRaw);
39 Instances dataFiltered = Filter.useFilter(dataRaw, filter);
40 System.out.println("\n\nFiltered data:\n\n" + dataFiltered);
41
42 // train J48 and output model
43 J48 classifier = new J48();
44 classifier.buildClassifier(dataFiltered);
45 System.out.println("\n\nClassifier model:\n\n" + classifier);
46 }
47}
48
2 import weka.core.converters. * ;
3 import weka.classifiers.trees. * ;
4 import weka.filters. * ;
5 import weka.filters.unsupervised.attribute. * ;
6
7 import java.io. * ;
8
9 /** */ /**
10 * Example class that converts HTML files stored in a directory structure into
11 * and ARFF file using the TextDirectoryLoader converter. It then applies the
12 * StringToWordVector to the data and feeds a J48 classifier with it.
13 *
14 * @author FracPete (fracpete at waikato dot ac dot nz)
15 */
16 public class TextCategorizationTest {
17
18 /** *//**
19 * Expects the first parameter to point to the directory with the text files.
20 * In that directory, each sub-directory represents a class and the text
21 * files in these sub-directories will be labeled as such.
22 *
23 * @param args the commandline arguments
24 * @throws Exception if something goes wrong
25 */
26 public static void main(String[] args) throws Exception {
27 // convert the directory into a dataset
28 TextDirectoryLoader loader = new TextDirectoryLoader();
29 loader.setDirectory(new File("./text_example"));
30 Instances dataRaw = loader.getDataSet();
31 System.out.println("\n\nImported data:\n\n" + dataRaw.numClasses());
32
33 // apply the StringToWordVector
34 // (see the source code of setOptions(String[]) method of the filter
35 // if you want to know which command-line option corresponds to which
36 // bean property)
37 StringToWordVector filter = new StringToWordVector();
38 filter.setInputFormat(dataRaw);
39 Instances dataFiltered = Filter.useFilter(dataRaw, filter);
40 System.out.println("\n\nFiltered data:\n\n" + dataFiltered);
41
42 // train J48 and output model
43 J48 classifier = new J48();
44 classifier.buildClassifier(dataFiltered);
45 System.out.println("\n\nClassifier model:\n\n" + classifier);
46 }
47}
48
最后,我还是建议数据建模和生成都自己写程序,数据准备往往自己的程序才能准确的控制,weka最多是帮我们做一下selection和classification。
另外补充一点,很多朋友问到了如何做文本分类,好吧,如果大家懒得去读paper的话,首先我普及一点,不管什么分类,分类器基本是可以通用的,注意是基本。关键是模型的构建和特征的生成。至于文本分类中用到的特征,TF*IDF还有其他如互信息,卡方统计,期望交叉熵等等,公式摆在那里,计算真的不难。因为就我接触过的分类问题,文本分类的特征计算应该是很容易的了。