从网上找样本数据太不好找了,尤其是想看看多分类的那种数据;而且数据量都偏小,不好玩。
得,还是自己造数据,当然规则自己拟。
自己造数据,生成arff文件。
static private void genArffData(String arffPath, int numRows, int numFields, int numClasses) throws FileNotFoundException {
// 生成一个n+1字段的随机数据,准备做多分类
Random random = new Random(Calendar.getInstance().getTimeInMillis());
File arff = new File(arffPath);
PrintWriter writer = new PrintWriter(new BufferedOutputStream(new FileOutputStream(arff)));
writer.println("@RELATION \"LogisticRegression FakeData\"");
writer.println();
int i=0;
for (; iprintln("@ATTRIBUTE " + (char)('A'+i) + " REAL");
}
writer.print("@ATTRIBUTE " + (char)('A'+i) + " {");
for (i=0; iif (i>0) writer.print(',');
writer.print((char)('0'+i));
}
writer.println('}');
writer.println();
writer.println("@DATA");
float [] values = new float[numFields];
for (i=0; ifor (int j=0; jprint(values[j]);
writer.print(',');
}
int classValue = computeClass(values, numClasses);
writer.println(classValue);
}
writer.close();
}
这段代码就只是打开文件,写内容而已……
关键是 computeClass 这个函数,自己定义一下数据怎么分类的规则。用上各种函数(使用Java这么多年,第一次关注一下Math里面有哪些东西……汗)
private static int computeClass(float[] values, int numClasses) {
float cv = values[0];
for(int i=1; iswitch (i) {
case 1:
cv += values[i] * 5;
break;
case 2:
cv += java.lang.Math.log10(values[i]);
break;
case 3:
cv += java.lang.Math.asin(values[i]);
break;
case 4:
cv += java.lang.Math.exp(values[i]);
break;
default:
cv += values[i]*i;
break;
}
}
int c;
if (cv<3) {
c = 0;
}
else if (cv > (numClasses+3)) {
c = numClasses-1;
}
else {
c = ((int) ((cv)*1.5) / numClasses);
if (c >= numClasses)
c = numClasses-1;
}
return c;
}
好了,放到main函数玩玩,来个10万行怎么样:
public static void main(String[] args) throws Exception {
final String arffFilePath = "data/LogisticRegressionFakeData.arff";
genArffData(arffFilePath, 100000, 6, 4);
Logistic logic = trainModel(arffFilePath, 6);
ArffLoader loader = new ArffLoader();
File inputFile = new File(arffFilePath);//测试语料文件
loader.setFile(inputFile);
Instances insTest =loader.getDataSet(); // 读入测试文件
insTest.setClassIndex(6); //设置分类属性所在行号(第一行为0号),instancesTest.numAttributes()可以取得属性总数
double sum = insTest.numInstances();//测试语料实例数
double right=0.0f;
for(int i=0;iif(logic.classifyInstance(ins)==ins.classValue()) {
right++;//正确值加一
System.out.println("No.\t" + i + "\t" + ins.classValue() + " RIGHT");
}
else {
System.out.println("No.\t" + i + "\t" + ins.classValue() + " WRONG");
}
}
System.out.println("classification precision:" + (right/sum));
}
跑出来的生成数据:
@RELATION "LogisticRegression FakeData"
@ATTRIBUTE A REAL
@ATTRIBUTE B REAL
@ATTRIBUTE C REAL
@ATTRIBUTE D REAL
@ATTRIBUTE E REAL
@ATTRIBUTE F REAL
@ATTRIBUTE G {0,1,2,3}
@DATA
0.71897244,0.32674688,0.34844375,0.14773273,0.60203516,0.030885875,1
0.87727785,0.26676136,0.9318922,0.50508565,0.22496736,0.39517665,2
0.44499284,0.5905153,0.7953741,0.05966431,0.13777435,0.106003165,1
0.37487888,0.8418185,0.33143914,0.6179532,0.39359564,0.96861655,3
0.047727704,0.23949718,0.58549887,0.53503656,0.83233106,0.5622865,2
0.70024496,0.43123567,0.18669724,0.20847279,0.17981762,0.79000807,3
0.5998019,0.39879912,0.83340144,0.5890504,0.70057064,0.049901605,2
0.6422481,0.31674922,0.18628752,0.6275924,0.66154146,0.54778665,2
0.09535301,0.63388544,0.20779681,0.16196364,0.37264192,0.73777825,3
……
运行的结果:
classification precision:0.9487
用weka工具看看取值的分布(看上去很漂亮?当然是调出来的……):
造出来的数据,跑出来的模型果真比较完美……如果再调调生成分类的规则呢,简单些,不用log、asin这些函数,是否能跑出100%的准确度?