《Spark MLlib 机器学习》细节解释(持续更新)

1、        P220

对该段文字的解决:

得到最大bin 数量后,求最大split 数量。对于无序特征,split = bin 数目/2;对于有序特征,split = bin 数目–1。

 

其中有读者问到:对于无序特征,split = bin 数目/2这个的由来,解释如下:

 

1)首先计算numBins:

        // 当前的特征数量小于m值,则认为无序

        if (numCategories <=maxCategoriesForUnorderedFeature) {//无序时

          unorderedFeatures.add(featureIndex)

          numBins(featureIndex) = numUnorderedBins(numCategories)

        } else {//有序时

          numBins(featureIndex) = numCategories

        }

根据以上可知,无序时numBins = numUnorderedBins(numCategories)

其中numUnorderedBins函数如下:

    /**

   * Given the arity of a categorical feature(arity = number of categories),

   * return the number of bins for the featureif it is to be treated as an unordered feature.

   * There is 1 split for every partitioning ofcategories into 2 disjoint, non-empty sets;

   * there are math.pow(2, arity - 1) - 1 suchsplits.

   * Each split has 2 corresponding bins.

   * 解释:一次划分会有2bins,好比,切西瓜,一刀下去,分成2

   */

  def numUnorderedBins(arity: Int): Int = 2 * ((1 << arity - 1) - 1)

 

根据公式:numBins = 2*math.pow(2,arity - 1) – 1

 

2)根据numBins计算numSplits:

 

  def numSplits(featureIndex: Int): Int = if(isUnordered(featureIndex)) {

    numBins(featureIndex) >> 1

  } else {

    numBins(featureIndex) - 1

  }

 

根据公式:numSplits = numBins/2= math.pow(2, arity - 1) – 1

你可能感兴趣的:(spark,机器学习,MLlib)