Java集成Weka做逻辑回归(Logistic Regression)(续)

从网上找样本数据太不好找了,尤其是想看看多分类的那种数据;而且数据量都偏小,不好玩。

得,还是自己造数据,当然规则自己拟。

自己造数据,生成arff文件。

    static private void genArffData(String arffPath, int numRows, int numFields, int numClasses) throws FileNotFoundException {

        // 生成一个n+1字段的随机数据,准备做多分类

        Random random = new Random(Calendar.getInstance().getTimeInMillis());

        File arff = new File(arffPath);
        PrintWriter writer = new PrintWriter(new BufferedOutputStream(new FileOutputStream(arff)));

        writer.println("@RELATION \"LogisticRegression FakeData\"");
        writer.println();

        int i=0;
        for (; iprintln("@ATTRIBUTE " + (char)('A'+i) + " REAL");
        }
        writer.print("@ATTRIBUTE " + (char)('A'+i) + " {");
        for (i=0; iif (i>0) writer.print(',');
            writer.print((char)('0'+i));
        }
        writer.println('}');
        writer.println();

        writer.println("@DATA");

        float [] values = new float[numFields];
        for (i=0; ifor (int j=0; jprint(values[j]);
                writer.print(',');
            }

            int classValue = computeClass(values, numClasses);
            writer.println(classValue);
        }

        writer.close();
    }

这段代码就只是打开文件,写内容而已……

关键是 computeClass 这个函数,自己定义一下数据怎么分类的规则。用上各种函数(使用Java这么多年,第一次关注一下Math里面有哪些东西……汗)

    private static int computeClass(float[] values, int numClasses) {

        float cv = values[0];
        for(int i=1; iswitch (i) {
            case 1:
                cv += values[i] * 5;
                break;
            case 2:
                cv += java.lang.Math.log10(values[i]);
                break;
            case 3:
                cv += java.lang.Math.asin(values[i]);
                break;
            case 4:
                cv += java.lang.Math.exp(values[i]);
                break;
            default:
                cv += values[i]*i;
                break;
            }
        }

        int c;
        if (cv<3) {
            c = 0;
        }
        else if (cv > (numClasses+3)) {
            c = numClasses-1;
        }
        else {
            c = ((int) ((cv)*1.5) / numClasses);
            if (c >= numClasses)
                c = numClasses-1;
        }
        return c; 
    }

好了,放到main函数玩玩,来个10万行怎么样:

    public static void main(String[] args) throws Exception {

        final String arffFilePath = "data/LogisticRegressionFakeData.arff";
        genArffData(arffFilePath, 100000, 6, 4);


        Logistic logic = trainModel(arffFilePath, 6);

        ArffLoader loader = new ArffLoader();
        File inputFile = new File(arffFilePath);//测试语料文件
        loader.setFile(inputFile);
        Instances insTest =loader.getDataSet(); // 读入测试文件
        insTest.setClassIndex(6); //设置分类属性所在行号(第一行为0号),instancesTest.numAttributes()可以取得属性总数

        double sum = insTest.numInstances();//测试语料实例数
        double right=0.0f;
        for(int i=0;iif(logic.classifyInstance(ins)==ins.classValue()) {
                right++;//正确值加一

                System.out.println("No.\t" + i + "\t" + ins.classValue() + " RIGHT");
            }
            else {
                System.out.println("No.\t" + i + "\t" + ins.classValue() + " WRONG");
            }
        }
        System.out.println("classification precision:" + (right/sum));
    }

跑出来的生成数据:

@RELATION "LogisticRegression FakeData"

@ATTRIBUTE A REAL
@ATTRIBUTE B REAL
@ATTRIBUTE C REAL
@ATTRIBUTE D REAL
@ATTRIBUTE E REAL
@ATTRIBUTE F REAL
@ATTRIBUTE G {0,1,2,3}

@DATA
0.71897244,0.32674688,0.34844375,0.14773273,0.60203516,0.030885875,1
0.87727785,0.26676136,0.9318922,0.50508565,0.22496736,0.39517665,2
0.44499284,0.5905153,0.7953741,0.05966431,0.13777435,0.106003165,1
0.37487888,0.8418185,0.33143914,0.6179532,0.39359564,0.96861655,3
0.047727704,0.23949718,0.58549887,0.53503656,0.83233106,0.5622865,2
0.70024496,0.43123567,0.18669724,0.20847279,0.17981762,0.79000807,3
0.5998019,0.39879912,0.83340144,0.5890504,0.70057064,0.049901605,2
0.6422481,0.31674922,0.18628752,0.6275924,0.66154146,0.54778665,2
0.09535301,0.63388544,0.20779681,0.16196364,0.37264192,0.73777825,3
……

运行的结果:

classification precision:0.9487

用weka工具看看取值的分布(看上去很漂亮?当然是调出来的……):
Java集成Weka做逻辑回归(Logistic Regression)(续)_第1张图片

用weka跑了一会……
Java集成Weka做逻辑回归(Logistic Regression)(续)_第2张图片

造出来的数据,跑出来的模型果真比较完美……如果再调调生成分类的规则呢,简单些,不用log、asin这些函数,是否能跑出100%的准确度?

你可能感兴趣的:(数据挖掘)