日撸 Java 三百行(51-60天,kNN 与 NB)

原文:日撸 Java 三百行(51-60天,kNN 与 NB)

目录

  • 51-53.kNN 分类器
    • 留一法测试
  • 54-55.基于 M-distance 的推荐
    • 一点解释
    • 代码相关
  • 56-57. kMeans 聚类
  • 58. 朴素贝叶斯
    • 拉普拉斯平滑

51-53.kNN 分类器

在老板的代码基础上做了修改和部分注释. 延伸:KNN

  1. 重新实现 computeNearests, 仅需要扫描一遍训练集, 即可获得 k k k个邻居.
  2. 增加 setDistanceMeasure() 方法.
  3. 增加 setNumNeighors() 方法.
  4. 增加 weightedVoting() 方法, 距离越短话语权越大. 支持两种以上的加权方式.
    加权方式参考:https://www.cnblogs.com/bigmonkey/p/7387943.html
  5. 实现 leave-one-out 测试.

2021-08-23

  • 增加了带权方式、带权与否的方法、变量;增加了带权相关的set方法.
  • 修改了predict()方法.
  • 修改了computeNearests()方法,使其只扫描一遍训练集.
  • 按要求增加了一堆set方法.
  • getAccuracy()方法加了些打印语句.

2021-08-24

  • 增加留一法测试相关
package day60;

import java.io.FileReader;
import java.util.Arrays;
import java.util.Random;

import extra.ToolForIRIS;
import weka.core.*;

/**
 * kNN classification.
 *
 * @author Fan Min [email protected].
 */
public class KnnClassification {

    /**
     * Manhattan distance.
     */
    public static final int MANHATTAN = 0;

    /**
     * Euclidean distance.
     */
    public static final int EUCLIDEAN = 1;

    /**
     * The distance measure.
     */
    public int distanceMeasure = EUCLIDEAN;

    /**
     * A random instance;
     */
    public static final Random random = new Random();

    /**
     * The number of neighbors.
     */
    int numNeighbors = 7;

    /**
     * The whole dataset.
     */
    Instances dataset;

    /**
     * The training set. Represented by the indices of the data.
     */
    int[] trainingSet;

    /**
     * The testing set. Represented by the indices of the data.
     */
    int[] testingSet;

    /**
     * The predictions.
     */
    int[] predictions;

    /**
     * 0:不带权.
     * 1: 反函数
     * 2: 高斯
     *
     * 默认1
     */
    public static final int WEIGHT_VOTING_SIMPLE = 0;
    public static final int WEIGHT_VOTING_REVERSE = 1;
    public static final int WEIGHT_VOTING_GAUSSIAN = 2;
    public int votingWay = WEIGHT_VOTING_REVERSE;

    /**
     * 存储距离值. The set of distances.
     */
    double[] distances;

    /**
     * ********************
     * The first constructor.
     *
     * @param paraFilename The arff filename.
     *                     ********************
     */
    public KnnClassification(String paraFilename) {
        try {
            FileReader fileReader = new FileReader(paraFilename);
            dataset = new Instances(fileReader);
            // The last attribute is the decision class.
            dataset.setClassIndex(dataset.numAttributes() - 1);
            fileReader.close();
        } catch (Exception ee) {
            System.out.println("Error occurred while trying to read \'" + paraFilename
                    + "\' in KnnClassification constructor.\r\n" + ee);
            System.exit(0);
        } // Of try
    }// Of the first constructor

    /**
     * ********************
     * Get a random indices for data randomization.
     *
     * @param paraLength The length of the sequence.
     * @return An array of indices, e.g., {4, 3, 1, 5, 0, 2} with length 6.
     * ********************
     */
    public static int[] getRandomIndices(int paraLength) { // 获得随机下标数组
        int[] resultIndices = new int[paraLength];

        // Step 1. Initialize.
        for (int i = 0; i < paraLength; i++) {
            resultIndices[i] = i;
        } // Of for i

        // Step 2. Randomly swap.
        int tempFirst, tempSecond, tempValue;
        for (int i = 0; i < paraLength; i++) {
            // Generate two random indices.
            tempFirst = random.nextInt(paraLength);
            tempSecond = random.nextInt(paraLength);

            // Swap.
            tempValue = resultIndices[tempFirst];
            resultIndices[tempFirst] = resultIndices[tempSecond];
            resultIndices[tempSecond] = tempValue;
        } // Of for i

        return resultIndices;
    }// Of getRandomIndices

    /**
     * ********************
     * Split the data into training and testing parts.
     *
     * @param paraTrainingFraction The fraction of the training set.
     *                             ********************
     */
    public void splitTrainingTesting(double paraTrainingFraction) { // 根据随机的下标分割样本集
        int tempSize = dataset.numInstances();
        int[] tempIndices = getRandomIndices(tempSize);  // 随机下标数组
        int tempTrainingSize = (int) (tempSize * paraTrainingFraction);

        trainingSet = new int[tempTrainingSize];
        testingSet = new int[tempSize - tempTrainingSize];

        for (int i = 0; i < tempTrainingSize; i++) {
            trainingSet[i] = tempIndices[i];
        } // Of for i

        for (int i = 0; i < tempSize - tempTrainingSize; i++) {
            testingSet[i] = tempIndices[tempTrainingSize + i];
        } // Of for i
    }// Of splitTrainingTesting

    /**
     * ********************
     * Predict for the whole testing set. The results are stored in predictions.
     * #see predictions.
     * ********************
     */
    public void predict() {
        predictions = new int[testingSet.length];
        for (int i = 0; i < predictions.length; i++) {
            predictions[i] = predict(testingSet[i]);  // 储存预测结果
        } // Of for i
    }// Of predict

    /**
     * ********************
     * Predict for given instance.
     *
     * @return The prediction.
     * ********************
     */
    public int predict(int paraIndex) {
        int[] tempNeighbors = computeNearests(paraIndex);
        int resultPrediction;
        if (votingWay == WEIGHT_VOTING_SIMPLE){ // 不带权
            resultPrediction = simpleVoting(tempNeighbors);
        } else{ // 带权
            resultPrediction = weightedVoting(tempNeighbors);
        }// Of if

        return resultPrediction;
    }// Of predict

    /**
     * ********************
     * The distance between two instances.
     *
     * @param paraI The index of the first instance.
     * @param paraJ The index of the second instance.
     * @return The distance.
     * ********************
     */
    public double distance(int paraI, int paraJ) {
        // ? double or int
        double resultDistance = 0;
        double tempDifference;
        switch (distanceMeasure) {
            case MANHATTAN: // 1范数
                for (int i = 0; i < dataset.numAttributes() - 1; i++) {
                    tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
                    if (tempDifference < 0) {
                        resultDistance -= tempDifference;
                    } else {
                        resultDistance += tempDifference;
                    } // Of if
                } // Of for i
                break;

            case EUCLIDEAN: // 2范数 (常用)
                for (int i = 0; i < dataset.numAttributes() - 1; i++) { // dataset.numAttributes() 属性数
                    tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
                    resultDistance += tempDifference * tempDifference;
                } // Of for i
                break;
            default:
                System.out.println("Unsupported distance measure: " + distanceMeasure);
        }// Of switch

        return resultDistance;
    }// Of distance

    /**
     * ********************
     * Get the accuracy of the classifier.
     *
     * @return The accuracy.
     * ********************
     */
    public double getAccuracy() {
        // A double divides an int gets another double.
        double tempCorrect = 0;
        for (int i = 0; i < predictions.length; i++) {
            if (predictions[i] == dataset.instance(testingSet[i]).classValue()) {
                tempCorrect++;
            } // Of if
            System.out.println("The result of " + testingSet[i] + "," + (predictions[i] == dataset.instance(testingSet[i]).classValue() ? "√" : "×"));
            System.out.println("The predict is " + predictions[i] + " : " + ToolForIRIS.getName(predictions[i]));
            System.out.println("The real value is " + dataset.instance(testingSet[i]).value(4) + " :" + dataset.instance(testingSet[i]).toString(4));
            System.out.println();
        } // Of for i

        return tempCorrect / testingSet.length;
    }// Of getAccuracy

    /**
     * ***********************************
     * Compute the nearest k neighbors. Select one neighbor in each scan. In
     * fact we can scan only once. You may implement it by yourself.
     * 

* I did. Only scan once. * * @param paraCurrent current instance. We are comparing it with all others. * @return the indices of the nearest instances. * *********************************** */ public int[] computeNearests(int paraCurrent) { int[] resultNearests; // 储存的是编号 int[] resultNearAll = new int[trainingSet.length]; // 储存的是编号 double[] resultDistance = new double[trainingSet.length]; // 存值 double tempDistance; boolean flag; for (int i = 0; i < trainingSet.length; i++) { tempDistance = distance(paraCurrent, trainingSet[i]); flag = false; // 插入到 resultNearsest for (int j = 0; j <= i; j++) { if (tempDistance < resultDistance[j]) { // 后移 for (int k = i; k >= j + 1; k--) { resultDistance[k] = resultDistance[k - 1]; resultNearAll[k] = resultNearAll[k - 1]; }// Of for k resultDistance[j] = tempDistance; resultNearAll[j] = trainingSet[i]; flag = true; // 已经进行后移,在原地添加 break; }// Of if }// Of for j if (!flag) { // 没有后移,要添加的值最大,放i处 resultDistance[i] = tempDistance; resultNearAll[i] = trainingSet[i]; }// Of if }// Of for i resultNearests = Arrays.copyOf(resultNearAll, numNeighbors); distances = Arrays.copyOf(resultDistance,numNeighbors); // 带权时用 System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(resultNearests)); return resultNearests; }// Of computeNearests /** * *********************************** * Voting using the instances. * * @param paraNeighbors The indices of the neighbors. * @return The predicted label. * *********************************** */ public int simpleVoting(int[] paraNeighbors) { int[] tempVotes = new int[dataset.numClasses()]; for (int i = 0; i < paraNeighbors.length; i++) { tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()]++; // 相应类别+1 } // Of for i int tempMaximalVotingIndex = 0; int tempMaximalVoting = 0; for (int i = 0; i < dataset.numClasses(); i++) { if (tempVotes[i] > tempMaximalVoting) { tempMaximalVoting = tempVotes[i]; tempMaximalVotingIndex = i; } // Of if } // Of for i return tempMaximalVotingIndex; }// Of simpleVoting /** * 设置距离计算方式 * * @param paraMeasure 0: 曼哈顿距离, 1: 欧氏距离 */ public void setDistanceMeasure(int paraMeasure) { if (paraMeasure == 0) { this.distanceMeasure = MANHATTAN; } else if (paraMeasure == 1) { this.distanceMeasure = EUCLIDEAN; } else { System.out.println("Wrong measure"); }// Of if }// Of setDistanceMeasure /** * 设置k * * @param paraNeighborsNum k */ public void setNumNeighbors(int paraNeighborsNum) { if (paraNeighborsNum > 0) { this.numNeighbors = paraNeighborsNum; } else { System.out.println("Wrong K-value"); }// Of if }// Of setNumNeighbors /** * 带权投票 * * @param paraNeighbors 最近k个邻居的下标数组 * @return 投票结果 */ public int weightedVoting(int[] paraNeighbors) { int tempMaximalVotingIndex = 0; double weight; double[] tempVotes = new double[dataset.numClasses()]; switch (votingWay) { case WEIGHT_VOTING_REVERSE: { // weight = 1 / (distance + const) for (int i = 0; i < paraNeighbors.length; i++) { weight = 1 / (distances[i] + 5); tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()] += weight; // 相应类别+1 } // Of for i }// Of case WEIGHT_VOTING_REVERSE case WEIGHT_VOTING_GAUSSIAN:{ // a是曲线的高度,b是曲线中心线在x轴的偏移,c是半峰宽度 double a = 1.0; double b = 0.0; double c = 10.0; // 需要调整 for (int i = 0; i < paraNeighbors.length; i++) { weight = a * Math.pow(Math.E,-((Math.pow((distances[i]-b),2))/(2*c*c))); tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()] += weight; // 相应类别+1 } // Of for i }// Of case WEIGHT_VOTING_GAUSSIAN default: } double tempMaximalVoting = 0.0; for (int i = 0; i < dataset.numClasses(); i++) { if (tempVotes[i] > tempMaximalVoting) { tempMaximalVoting = tempVotes[i]; tempMaximalVotingIndex = i; } // Of if } // Of for i return tempMaximalVotingIndex; }// Of weightedVoting /** * 设置投票方式. * 0:无权 * 1:反函数 * 2:高斯 * * @param paraWay votingWay */ public void setWeightWay(int paraWay){ if (paraWay >=0 && paraWay <=2){ this.votingWay = paraWay; }else{ System.out.println("Wrong parameter"); }// Of if }// Of setWeightWay /** * ******************** * The entrance of the program. * * @param args Not used now. * ******************** */ public static void main(String args[]) { KnnClassification tempClassifier = new KnnClassification("E:\\Master\\Day90\\src\\files\\iris.arff"); tempClassifier.setWeightWay(WEIGHT_VOTING_GAUSSIAN); tempClassifier.splitTrainingTesting(0.8); // 80%为训练集 tempClassifier.predict(); System.out.println("The accuracy of the classifier is: " + tempClassifier.getAccuracy()); }// Of main }// Of class KnnClassification

一次运行结果:

The nearest of 8 are: [24, 2, 3, 39, 7, 11, 13]
The nearest of 106 are: [113, 67, 142, 121, 64, 92, 95]
The nearest of 65 are: [50, 83, 74, 87, 145, 52, 149]
The nearest of 122 are: [105, 107, 125, 129, 130, 118, 102]
The nearest of 124 are: [105, 107, 125, 116, 140, 83, 100]
The nearest of 12 are: [24, 2, 10, 3, 39, 7, 11]
The nearest of 43 are: [24, 2, 10, 3, 39, 7, 11]
The nearest of 27 are: [24, 2, 10, 3, 39, 7, 11]
The nearest of 108 are: [105, 107, 125, 116, 140, 83, 100]
The nearest of 42 are: [24, 2, 3, 39, 7, 11, 13]
The nearest of 90 are: [113, 83, 74, 87, 149, 79, 67]
The nearest of 44 are: [24, 2, 10, 3, 39, 7, 11]
The nearest of 137 are: [107, 125, 113, 50, 116, 140, 83]
The nearest of 18 are: [24, 10, 39, 7, 11, 49, 30]
The nearest of 31 are: [24, 2, 10, 3, 39, 7, 11]
The nearest of 77 are: [50, 116, 140, 83, 74, 87, 145]
The nearest of 21 are: [24, 2, 10, 3, 39, 7, 11]
The nearest of 5 are: [24, 2, 10, 3, 39, 7, 11]
The nearest of 0 are: [24, 2, 10, 3, 39, 7, 11]
The nearest of 146 are: [113, 50, 116, 140, 83, 74, 87]
The nearest of 53 are: [74, 87, 79, 67, 51, 126, 72]
The nearest of 127 are: [113, 50, 116, 140, 83, 74, 87]
The nearest of 109 are: [105, 107, 125, 131, 116, 140, 117]
The nearest of 143 are: [105, 107, 125, 116, 140, 83, 117]
The nearest of 115 are: [125, 113, 50, 116, 140, 83, 87]
The nearest of 1 are: [24, 2, 10, 3, 39, 7, 11]
The nearest of 139 are: [107, 125, 50, 116, 140, 83, 100]
The nearest of 71 are: [50, 74, 87, 52, 79, 76, 67]
The nearest of 148 are: [113, 50, 116, 140, 83, 100, 145]
The nearest of 62 are: [74, 87, 52, 79, 76, 67, 126]
The result of 8,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 106,×
The predict is 1 : Iris-versicolor
The real value is 2.0 :Iris-virginica

The result of 65,The predict is 1 : Iris-versicolor
The real value is 1.0 :Iris-versicolor

The result of 122,The predict is 2 : Iris-virginica
The real value is 2.0 :Iris-virginica

The result of 124,The predict is 2 : Iris-virginica
The real value is 2.0 :Iris-virginica

The result of 12,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 43,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 27,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 108,The predict is 2 : Iris-virginica
The real value is 2.0 :Iris-virginica

The result of 42,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 90,The predict is 1 : Iris-versicolor
The real value is 1.0 :Iris-versicolor

The result of 44,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 137,The predict is 2 : Iris-virginica
The real value is 2.0 :Iris-virginica

The result of 18,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 31,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 77,The predict is 1 : Iris-versicolor
The real value is 1.0 :Iris-versicolor

The result of 21,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 5,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 0,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 146,×
The predict is 1 : Iris-versicolor
The real value is 2.0 :Iris-virginica

The result of 53,The predict is 1 : Iris-versicolor
The real value is 1.0 :Iris-versicolor

The result of 127,×
The predict is 1 : Iris-versicolor
The real value is 2.0 :Iris-virginica

The result of 109,The predict is 2 : Iris-virginica
The real value is 2.0 :Iris-virginica

The result of 143,The predict is 2 : Iris-virginica
The real value is 2.0 :Iris-virginica

The result of 115,The predict is 2 : Iris-virginica
The real value is 2.0 :Iris-virginica

The result of 1,The predict is 0 : Iris-setosa
The real value is 0.0 :Iris-setosa

The result of 139,The predict is 2 : Iris-virginica
The real value is 2.0 :Iris-virginica

The result of 71,The predict is 1 : Iris-versicolor
The real value is 1.0 :Iris-versicolor

The result of 148,The predict is 2 : Iris-virginica
The real value is 2.0 :Iris-virginica

The result of 62,The predict is 1 : Iris-versicolor
The real value is 1.0 :Iris-versicolor

The accuracy of the classifier is: 0.9

留一法测试

增加2个方法.
【疑问】留一法测试的话,没有随机划分的数据集, 那么每次执行的精确度都将是固定的?

	/**
     * 留一法划分
     *
     * @param paraTestingIndex 测试样本下标
     */
    public void splitTrainingTestingOne(int paraTestingIndex){
        int tempSize = dataset.numInstances();

        trainingSet = new int[tempSize - 1];
        testingSet = new int[1];

        testingSet[0] = paraTestingIndex;
        int tempIndex = 0;
        for (int i = 0; i < tempSize; i++) {
            if (i!=paraTestingIndex){
                trainingSet[tempIndex++] = i;
            }// Of if
        }// Of for i
    }// Of splitTrainingTestingOne

	/**
     * 留一法测试
     *
     * 单元测试 [有待解耦、优化]
     */
    public static void leaveOneOutTest(){
        KnnClassification tempClassifier = new KnnClassification("E:\\Master\\Day90\\src\\files\\iris.arff");
        double tempCount = 0;
        // 1. 划分测试集,每个都当一次测试集
        for (int i = 0; i < tempClassifier.dataset.numInstances(); i++) {
            tempClassifier.splitTrainingTestingOne(i); // 划分,当前i为训练集
            // 2.预测
            tempClassifier.predict();
            // 3.判断
            if (tempClassifier.predictions[0] == tempClassifier.dataset.instance(i).classValue()){
                tempCount++;
            }// Of if
        }// Of for i
        double tempRes = tempCount/tempClassifier.dataset.numInstances();
        System.out.println("The accuracy of the classifier is "+ tempRes);
    }// Of leaveOneOutTest


	public static void main(String args[]) {
        leaveOneOutTest();
    }// Of main

测试结果:

The nearest of 0 are: [17, 4, 39, 27, 28, 40, 7]
The nearest of 1 are: [45, 12, 9, 34, 37, 25, 30]
The nearest of 2 are: [47, 3, 6, 12, 45, 42, 29]
The nearest of 3 are: [47, 29, 30, 2, 45, 12, 38]
The nearest of 4 are: [0, 17, 40, 7, 39, 27, 19]
The nearest of 5 are: [18, 10, 48, 44, 19, 46, 16]
The nearest of 6 are: [47, 2, 11, 42, 29, 3, 30]
The nearest of 7 are: [39, 49, 0, 17, 26, 4, 11]
The nearest of 8 are: [38, 3, 42, 13, 47, 12, 45]
The nearest of 9 are: [34, 37, 1, 30, 12, 25, 49]
The nearest of 10 are: [48, 27, 36, 19, 46, 5, 16]
The nearest of 11 are: [29, 7, 26, 24, 30, 6, 49]
The nearest of 12 are: [1, 9, 34, 37, 45, 30, 2]
The nearest of 13 are: [38, 42, 8, 47, 2, 3, 12]
The nearest of 14 are: [33, 16, 15, 18, 10, 36, 48]
The nearest of 15 are: [33, 14, 5, 16, 18, 32, 10]
The nearest of 16 are: [10, 48, 33, 19, 5, 21, 36]
The nearest of 17 are: [0, 40, 4, 39, 27, 28, 7]
The nearest of 18 are: [5, 10, 48, 20, 16, 31, 36]
The nearest of 19 are: [21, 46, 48, 4, 17, 0, 27]
The nearest of 20 are: [31, 27, 28, 10, 39, 48, 36]
The nearest of 21 are: [19, 46, 17, 4, 48, 0, 27]
The nearest of 22 are: [6, 2, 40, 42, 47, 4, 35]
The nearest of 23 are: [26, 43, 39, 7, 31, 17, 27]
The nearest of 24 are: [11, 29, 26, 30, 7, 23, 39]
The nearest of 25 are: [9, 34, 37, 1, 30, 12, 45]
The nearest of 26 are: [23, 43, 7, 39, 17, 11, 49]
The nearest of 27 are: [28, 0, 39, 17, 48, 7, 4]
The nearest of 28 are: [27, 0, 39, 17, 7, 49, 40]
The nearest of 29 are: [30, 3, 11, 47, 9, 34, 37]
The nearest of 30 are: [29, 9, 34, 37, 3, 25, 45]
The nearest of 31 are: [20, 27, 28, 36, 17, 10, 39]
The nearest of 32 are: [33, 46, 19, 48, 10, 5, 16]
The nearest of 33 are: [32, 15, 16, 14, 5, 10, 48]
The nearest of 34 are: [9, 37, 1, 30, 12, 25, 49]
The nearest of 35 are: [49, 1, 2, 40, 28, 9, 34]
The nearest of 36 are: [10, 31, 28, 48, 27, 0, 20]
The nearest of 37 are: [9, 34, 1, 30, 12, 25, 49]
The nearest of 38 are: [8, 42, 13, 3, 47, 2, 45]
The nearest of 39 are: [7, 0, 27, 28, 49, 17, 26]
The nearest of 40 are: [17, 0, 4, 7, 49, 39, 28]
The nearest of 41 are: [8, 38, 45, 13, 12, 1, 3]
The nearest of 42 are: [38, 47, 3, 2, 6, 8, 13]
The nearest of 43 are: [26, 23, 21, 17, 7, 40, 39]
The nearest of 44 are: [46, 5, 21, 19, 43, 48, 26]
The nearest of 45 are: [1, 12, 30, 2, 3, 9, 34]
The nearest of 46 are: [19, 21, 48, 4, 27, 10, 32]
The nearest of 47 are: [3, 2, 42, 6, 29, 38, 12]
The nearest of 48 are: [10, 27, 19, 46, 21, 0, 17]
The nearest of 49 are: [7, 39, 35, 0, 28, 17, 40]
The nearest of 50 are: [52, 86, 65, 76, 58, 75, 77]
The nearest of 51 are: [56, 75, 65, 86, 91, 74, 58]
The nearest of 52 are: [50, 86, 77, 76, 58, 65, 54]
The nearest of 53 are: [89, 80, 69, 81, 92, 94, 90]
The nearest of 54 are: [58, 75, 76, 86, 74, 65, 51]
The nearest of 55 are: [66, 90, 96, 94, 78, 95, 99]
The nearest of 56 are: [51, 85, 91, 127, 86, 70, 138]
The nearest of 57 are: [93, 98, 60, 81, 80, 59, 79]
The nearest of 58 are: [75, 54, 65, 76, 86, 74, 51]
The nearest of 59 are: [89, 94, 53, 80, 69, 64, 88]
The nearest of 60 are: [93, 57, 81, 80, 98, 53, 69]
The nearest of 61 are: [96, 78, 95, 99, 88, 97, 71]
The nearest of 62 are: [92, 69, 67, 80, 82, 53, 87]
The nearest of 63 are: [91, 73, 78, 97, 138, 126, 54]
The nearest of 64 are: [82, 79, 88, 99, 69, 59, 92]
The nearest of 65 are: [75, 58, 86, 51, 74, 54, 50]
The nearest of 66 are: [84, 55, 96, 78, 61, 95, 88]
The nearest of 67 are: [92, 82, 99, 69, 94, 95, 96]
The nearest of 68 are: [87, 72, 119, 54, 73, 146, 123]
The nearest of 69 are: [80, 89, 81, 92, 82, 53, 67]
The nearest of 70 are: [138, 127, 149, 85, 56, 126, 91]
The nearest of 71 are: [97, 82, 92, 61, 99, 74, 67]
The nearest of 72 are: [133, 123, 146, 83, 119, 126, 54]
The nearest of 73 are: [63, 91, 78, 97, 55, 72, 54]
The nearest of 74 are: [97, 75, 58, 54, 65, 51, 71]
The nearest of 75 are: [65, 58, 74, 51, 54, 86, 97]
The nearest of 76 are: [58, 86, 52, 54, 77, 50, 75]
The nearest of 77 are: [52, 86, 147, 76, 110, 133, 123]
The nearest of 78 are: [91, 63, 61, 97, 55, 73, 66]
The nearest of 79 are: [81, 80, 69, 64, 82, 92, 67]
The nearest of 80 are: [81, 69, 53, 89, 92, 79, 82]
The nearest of 81 are: [80, 69, 79, 53, 89, 92, 82]
The nearest of 82 are: [92, 99, 67, 69, 71, 94, 89]
The nearest of 83 are: [133, 101, 142, 149, 123, 127, 72]
The nearest of 84 are: [66, 55, 96, 88, 94, 90, 95]
The nearest of 85 are: [56, 70, 51, 91, 78, 61, 138]
The nearest of 86 are: [52, 65, 58, 50, 75, 76, 77]
The nearest of 87 are: [68, 72, 62, 54, 97, 74, 73]
The nearest of 88 are: [95, 96, 99, 94, 61, 82, 66]
The nearest of 89 are: [53, 69, 80, 94, 92, 99, 59]
The nearest of 90 are: [94, 55, 96, 89, 99, 67, 95]
The nearest of 91 are: [63, 78, 73, 97, 51, 56, 74]
The nearest of 92 are: [82, 67, 99, 69, 94, 89, 71]
The nearest of 93 are: [57, 60, 98, 81, 80, 79, 59]
The nearest of 94 are: [99, 96, 90, 89, 88, 92, 55]
The nearest of 95 are: [96, 88, 99, 94, 61, 55, 67]
The nearest of 96 are: [95, 99, 88, 94, 61, 55, 92]
The nearest of 97 are: [74, 71, 91, 78, 61, 63, 75]
The nearest of 98 are: [57, 93, 60, 79, 81, 80, 64]
The nearest of 99 are: [96, 94, 88, 95, 82, 92, 67]
The nearest of 100 are: [136, 144, 104, 143, 140, 124, 148]
The nearest of 101 are: [142, 113, 121, 149, 83, 127, 138]
The nearest of 102 are: [125, 120, 143, 130, 112, 129, 124]
The nearest of 103 are: [116, 137, 128, 111, 132, 147, 104]
The nearest of 104 are: [132, 128, 140, 124, 143, 112, 120]
The nearest of 105 are: [122, 107, 135, 118, 130, 125, 117]
The nearest of 106 are: [84, 59, 90, 89, 94, 66, 53]
The nearest of 107 are: [130, 125, 105, 102, 129, 122, 135]
The nearest of 108 are: [128, 103, 116, 132, 111, 112, 104]
The nearest of 109 are: [143, 120, 144, 102, 135, 124, 125]
The nearest of 110 are: [147, 115, 77, 145, 137, 116, 141]
The nearest of 111 are: [147, 128, 146, 103, 116, 123, 132]
The nearest of 112 are: [139, 140, 120, 145, 124, 116, 104]
The nearest of 113 are: [101, 142, 121, 114, 83, 149, 146]
The nearest of 114 are: [121, 101, 142, 113, 149, 127, 138]
The nearest of 115 are: [148, 110, 145, 147, 136, 140, 132]
The nearest of 116 are: [137, 103, 147, 111, 128, 112, 132]
The nearest of 117 are: [131, 105, 109, 135, 122, 125, 107]
The nearest of 118 are: [122, 105, 135, 107, 130, 117, 102]
The nearest of 119 are: [72, 83, 68, 146, 113, 123, 133]
The nearest of 120 are: [143, 140, 124, 144, 112, 139, 102]
The nearest of 121 are: [101, 142, 113, 149, 114, 138, 70]
The nearest of 122 are: [105, 118, 107, 130, 135, 125, 117]
The nearest of 123 are: [126, 146, 127, 72, 133, 83, 111]
The nearest of 124 are: [120, 143, 112, 140, 104, 144, 139]
The nearest of 125 are: [129, 102, 107, 130, 143, 120, 124]
The nearest of 126 are: [123, 127, 138, 146, 83, 133, 63]
The nearest of 127 are: [138, 126, 149, 70, 123, 83, 133]
The nearest of 128 are: [132, 104, 103, 111, 116, 137, 112]
The nearest of 129 are: [125, 130, 102, 107, 112, 139, 108]
The nearest of 130 are: [107, 102, 125, 129, 135, 105, 122]
The nearest of 131 are: [117, 105, 135, 109, 125, 122, 107]
The nearest of 132 are: [128, 104, 103, 111, 112, 140, 116]
The nearest of 133 are: [83, 72, 123, 126, 127, 63, 111]
The nearest of 134 are: [103, 83, 133, 111, 116, 137, 119]
The nearest of 135 are: [130, 105, 102, 107, 122, 125, 109]
The nearest of 136 are: [148, 115, 100, 144, 140, 124, 104]
The nearest of 137 are: [116, 103, 147, 128, 111, 110, 112]
The nearest of 138 are: [127, 70, 126, 149, 123, 78, 63]
The nearest of 139 are: [112, 145, 141, 120, 140, 124, 147]
The nearest of 140 are: [144, 120, 112, 143, 104, 124, 139]
The nearest of 141 are: [145, 139, 112, 110, 147, 115, 140]
The nearest of 142 are: [101, 113, 121, 149, 83, 127, 138]
The nearest of 143 are: [120, 124, 144, 140, 104, 102, 112]
The nearest of 144 are: [140, 120, 143, 124, 136, 104, 100]
The nearest of 145 are: [141, 147, 139, 112, 115, 140, 110]
The nearest of 146 are: [123, 111, 126, 72, 83, 133, 101]
The nearest of 147 are: [110, 111, 116, 145, 115, 137, 77]
The nearest of 148 are: [136, 115, 110, 147, 140, 137, 124]
The nearest of 149 are: [127, 138, 101, 142, 70, 83, 121]
The accuracy of the classifier is 0.9666666666666667

54-55.基于 M-distance 的推荐

一点解释

所谓 M-distance, 就是根据平均分来计算两个用户 (或项目) 之间的距离.
关于数学表达式:

  1. j j j个项目关于第 i i i个用户的邻居项目集合为:
    N i j ​ = { 1 ≤ j ′ ≤ m ∣ j ′ ≠ j , p i j ′ ≠ 0 , ∣ r ⋅ j ‾ − r ⋅ j ′ ‾ ∣ < ϵ } N_{ij}​ =\{1≤j^′≤m∣j^′\neq j,p_{ij^{'}}\neq 0,∣\overline {r_{⋅j}} − \overline{r_{⋅j^′}}∣<ϵ\} Nij={1jmj=j,pij=0,rjrj<ϵ}
    其中, 1: j ′ ≠ j j^′\neq j j=j:不包含项目 j j j本身. 2: p i j ′ ≠ 0 p_{ij^{'}}\neq 0 pij=0:用户 i i i对项目 j ′ j^{'} j有评分. 3: ∣ r ⋅ j ‾ − r ⋅ j ′ ‾ ∣ < ϵ ∣\overline {r_{⋅j}} − \overline{r_{⋅j^′}}∣<ϵ rjrj<ϵ:项目 j j j j ′ j^{'} j的评分平均值的差值在一定范围内.在这里插入图片描述

  2. i i i个用户对 j j j个项目的评分预测为:
    p i j = ∑ j ′ ∈ N i j r i j ′ ∣ N i j ∣ p_{ij}=\frac{\sum_{j^{'}\in N_{ij}}r_{ij^{'}}}{|N_{ij}|} pij=NijjNijrij
    j j j项目关于用户 i i i的邻居项目集合里的每一个项目的评分相加,再除以邻居集合的模,即取均值.

如何定义邻居?和 k k k无关,定义为距离小于 ϵ ϵ ϵ的项目

代码相关

MAE:平均绝对误差(mean absolute error).
RMSE:均方根误差(root mean squared error).
衡量变量精度的两个最常用的指标,同时也是机器学习中评价模型的两把重要标尺.

56-57. kMeans 聚类

获得虚拟中心后, 换成与其最近的点作为实际中心, 再聚类.

            // Day 57: 虚拟中心 -> 最近的点
            for (int i = 0; i < numClusters; i++) {
                double tempMinDistance = Double.MAX_VALUE;
                int tempMinPos = -1;
                double tempNewDistance;
                for (int j = 0; j < dataset.numInstances(); j++) {
                    // 计算差值找最小
                    if (tempClusterArray[j]==i){
                        tempNewDistance = distance(j,tempNewCenters[i]);
                        if (tempNewDistance < tempMinDistance){
                            tempMinDistance = tempNewDistance;
                            tempMinPos = j;
                        }// Of if
                    }// Of if
                }// Of for j

58. 朴素贝叶斯

拉普拉斯平滑

平滑方法的存在是为了解决零概率问题
方法为:分母加上类别数量,分子加一
例:分类任务中,有3个类别: C 1 , C 2 , C 3 C_1,C_2,C_3 C1,C2,C3,某个物品 I 1 I_1 I1,每个类别观测值分别是0,990,10,则 I 1 I_1 I1在三个类别的概率为:0, 0.99, 0.1

拉普拉斯平滑处理后:
三个类别概率为:1/1003=0.001 991/1003=0.988 11/1003=0.011

你可能感兴趣的:(日撸代码300行,Java,java)