什么是决策树
现实生活中的树
数据结构中的树
机器学习中的树
决策树的关键步骤是分裂属性。
所谓分裂属性就是在某个节点处按照某一特征属性的不同划分构造不同的分支,其目标是让各个分裂子集尽可能地“纯”。尽可能“纯”就是尽量让一个分裂子集中待分类项属于同一类别。
而判断“纯”的方法不同引出了我们的ID3算法,C4.5算法以及CART算法
基于规则建树
基于模型的建树
总结构建决策树的三要素
特征选择
熵:物理学上度量能量分布不确定性的量。
信息熵、香农熵:为了消除信息不确定性,代表随机变量的复杂度
H X = − ∑ i = 1 n P ( x i ) l o g P ( x i ) HX=-\sum_{i=1}^{n}P_{(x_i)}log^{P_{(x_i)}} HX=−i=1∑nP(xi)logP(xi)
条件熵
H ( Y ∣ X ) = ∑ x ∈ X P ( x ) l o g P ( Y ∣ X = x ) H(Y|X)=\sum_{x \in X}P(x)log^{P(Y|X=x)} H(Y∣X)=x∈X∑P(x)logP(Y∣X=x)
信息增益
信息增益率
用信息增益率来选择属性,克服用信息增益选择时候偏向选择取值多属性不足。
G a i n R a t e ( D , A ) = G a i n ( D , A ) H ( A ) GainRate(D,A) =\frac {Gain(D,A)} {H(A)}\\ GainRate(D,A)=H(A)Gain(D,A)
D为数据集、A为一个特征、其中H(A)为A的熵
参考资料
ID3算法、C4.5算法
ID3
C4.5
使用了信息增益率这个量来作为纯度的度量。我们选取使得信息增益率最大的特征进行分裂!
**我们一开始分析到,**信息增益准则其实是对可取值数目较多的属性有所偏好!(比如上面提到的编号,可能取值是实例个数,最多了,分的类别越多,分到每一个子结点,子结点的纯度也就越可能大,因为数量少了嘛,可能在一个类的可能性就最大)。
但是在前面分析了,并不是很好,所以我们需要除以一个属性的固定值(这个属性的熵),这个值要求随着分成的类别数越大而越小。于是让它做了分母。这样可以避免信息增益的缺点。
那么信息增益率就是完美无瑕的吗?
当然不是,有了这个分母之后,我们可以看到增益率准则其实对可取类别数目较少的特征有所偏好!毕竟分母越小,整体越大。
于是C4.5算法不直接选择增益率最大的候选划分属性,候选划分属性中找出信息增益高于平均水平的属性(这样保证了大部分好的的特征),再从中选择增益率最高的(又保证了不会出现编号特征这种极端的情况)
深入浅出理解决策树算法(二)-ID3算法与C4.5算法
Cart树算法
简称:分类和回归树—和ID3、C4.5区别
区别和联系:Cart树是二叉树,ID3和C4.5多棵决策树
Cart树在分类上使用的是集合Gini系数
G i n i = ∑ i = 1 m P i ( 1 − P i ) = 1 − ∑ i = 1 m P i 2 Gini=\sum_{i=1}^{m}P_i(1-P_i)=1-\sum_{i=1}^{m}P_i^2 Gini=i=1∑mPi(1−Pi)=1−i=1∑mPi2
G I N I ( D , A ) = ∣ D 1 ∣ ∣ D ∣ G i n i ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ G i n i ( D 2 ) GINI(D,A)=\frac {|D_1|}{|D|}Gini(D_1)+\frac {|D_2|}{|D|}Gini(D_2) GINI(D,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)
回归问题上MSE(mean square error-sum)
1 m ∑ i = 1 m ( y i − y i ‾ ) 2 \frac 1 m \sum_{i=1}^{m}(y_i-\overline{y_i})^2 m1i=1∑m(yi−yi)2
回归树
是GBDT、XGBOOST算法的基础
分类树
G I N I ( D , A ) = ∣ D 1 ∣ ∣ D ∣ G i n i ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ G i n i ( D 2 ) GINI(D,A)=\frac {|D_1|}{|D|}Gini(D_1)+\frac {|D_2|}{|D|}Gini(D_2) GINI(D,A)=∣D∣∣D1∣Gini(D1)+∣D∣∣D2∣Gini(D2)
决策树、ID3、C4.5、Cart回归树、Cart分类树的剪枝问题还没有分析,每天分析下。
SparkMllib完成建模分析实践
鸢尾花iris实战,使用rdd方式
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object SparkMllibIris1 {
def main(args: Array[String]): Unit = {
// 1.准本环境
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkMllibIris1Rdd")
val sc = new SparkContext(conf)
// 2.读取数据
val path = "iris.csv"
val rdd: RDD[String] = sc.textFile(path)
// rdd.foreach(println)
// 6.2,3.4,5.4,2.3,Iris-virginica
// 5.9,3.0,5.1,1.8,Iris-virginica
// 3,特征工程
// 3-1得到LabelPoint rdd中很多好用的API都没有,需要使用传统的方式进行特征提取,转换,选择
var rddLp: RDD[LabeledPoint] = rdd.map(
x => {
val strings: Array[String] = x.split(",")
regression.LabeledPoint(
strings(4) match {
case "Iris-setosa" => 0.0
case "Iris-versicolor" => 1.0
case "Iris-virginica" => 2.0
}
,
Vectors.dense(
strings(0).toDouble,
strings(1).toDouble,
strings(2).toDouble,
strings(3).toDouble))
}
)
// rddLp.foreach(println)
// (1.0,[6.0,2.9,4.5,1.5])
// (0.0,[5.1,3.5,1.4,0.2])
// 4. 分割数据集为训练集和测试集
val Array(trainData,testData): Array[RDD[LabeledPoint]] = rddLp.randomSplit(Array(0.8,0.2))
// 5. 构建模型
val decisonModel: DecisionTreeModel = DecisionTree.trainClassifier(trainData,3, Map[Int, Int](),"gini",8,16)
// 6. 得到测试集预测的结果,跟原有的标签共同构成一个元组,方便后面进行相应的计算
// 而DataFrame中有相应的函数,可以帮助我们进行校验,RDD没有这方面的待遇,需要自己写相应的方法
val result: RDD[(Double, Double)] = testData.map(
x=> {
val pre: Double = decisonModel.predict(x.features)
(x.label,pre)
}
)
val acc: Double = result.filter(x=>x._1==x._2).count().toDouble /result.count()
println(acc)
println("error", (1-acc))
// 0.9642857142857143
// (error,0.0357142857142857)
}
}
鸢尾花iris实战-DataFrame方式实现
import org.apache.spark.ml.classification.{DecisionTreeClassificationModel, DecisionTreeClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, StringIndexerModel, VectorAssembler}
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
object SparkMlIris2 {
def main(args: Array[String]): Unit = {
// * 1-准备环境
val sparkSession: SparkSession = SparkSession.builder().master("local[*]").appName("SparkMllibIris2").getOrCreate()
// * 2-准备数据
// 2-1 通过CSV的方式来读取数据,官网有读取的方式 http://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html
var path = "irisHeader.csv"
// 注意要添加 .option("inferSchema", "true"),否则df schema 都是String类型的
val df: DataFrame = sparkSession.read.format("csv").option("inferSchema", "true").option("header","true").option("sep",",").load(path)
// df.printSchema()
// root
// |-- sepal_length: double (nullable = true)
// |-- sepal_width: double (nullable = true)
// |-- petal_length: double (nullable = true)
// |-- petal_width: double (nullable = true)
// |-- class: string (nullable = true)
//df.show(false)
// +------------+-----------+------------+-----------+-----------+
// |sepal_length|sepal_width|petal_length|petal_width|class |
// +------------+-----------+------------+-----------+-----------+
// |5.1 |3.5 |1.4 |0.2 |Iris-setosa|
//* 4-特征工程
//4-1将4个特征整合为一个特征向量
val assembler: VectorAssembler = new VectorAssembler().setInputCols(Array("sepal_length","sepal_width","petal_length","petal_width")).setOutputCol("features")
val assmblerDf: DataFrame = assembler.transform(df)
assmblerDf.show(false)
//4-2将类别型class转变为数值型
val stringIndex: StringIndexer = new StringIndexer().setInputCol("class").setOutputCol("label")
val stingIndexModel: StringIndexerModel = stringIndex.fit(assmblerDf)
val indexDf: DataFrame = stingIndexModel.transform(assmblerDf)
// indexDf.show(false)
// +------------+-----------+------------+-----------+-----------+-----------------+-----+
// |sepal_length|sepal_width|petal_length|petal_width|class |features |label|
// +------------+-----------+------------+-----------+-----------+-----------------+-----+
// |5.1 |3.5 |1.4 |0.2 |Iris-setosa|[5.1,3.5,1.4,0.2]|0.0 |
// |4.9 |3.0 |1.4 |0.2 |Iris-setosa|[4.9,3.0,1.4,0.2]|0.0 |
//4-3将数据切分成两部分,分别为训练数据集和测试数据集
val Array(trainData,testData): Array[Dataset[Row]] = indexDf.randomSplit(Array(0.8,0.2))
// * 5-准备计算法,设置特征列和标签列
val classifier: DecisionTreeClassifier = new DecisionTreeClassifier().setFeaturesCol("features").setMaxBins(16).setImpurity("gini").setSeed(10)
val dtcModel: DecisionTreeClassificationModel = classifier.fit(trainData)
// * 6-完成建模分析
val trainPre: DataFrame = dtcModel.transform(trainData)
// * 7-预测分析
val testPre: DataFrame = dtcModel.transform(testData)
// * 8-模型的校验或保存
//val savePath = "E:\\ml\\workspace\\SparkMllibBase\\sparkmllib_part2\\DescitionTree\\model"
//dtcModel.save(savePath)
// trainPre.show(false)
// +------------+-----------+------------+-----------+---------------+-----------------+-----+--------------+-------------+----------+
// |sepal_length|sepal_width|petal_length|petal_width|class |features |label|rawPrediction |probability |prediction|
// +------------+-----------+------------+-----------+---------------+-----------------+-----+--------------+-------------+----------+
// |4.3 |3.0 |1.1 |0.1 |Iris-setosa |[4.3,3.0,1.1,0.1]|0.0 |[47.0,0.0,0.0]|[1.0,0.0,0.0]|0.0 |
// |4.4 |2.9 |1.4 |0.2 |Iris-setosa |[4.4,2.9,1.4,0.2]|0.0 |[47.0,0.0,0.0]|[1.0,0.0,0.0]|0.0 |
// testPre.show(false)
// +------------+-----------+------------+-----------+---------------+-----------------+-----+--------------+-------------+----------+
// |sepal_length|sepal_width|petal_length|petal_width|class |features |label|rawPrediction |probability |prediction|
// +------------+-----------+------------+-----------+---------------+-----------------+-----+--------------+-------------+----------+
// |4.6 |3.2 |1.4 |0.2 |Iris-setosa |[4.6,3.2,1.4,0.2]|0.0 |[47.0,0.0,0.0]|[1.0,0.0,0.0]|0.0 |
// |4.8 |3.4 |1.9 |0.2 |Iris-setosa |[4.8,3.4,1.9,0.2]|0.0 |[47.0,0.0,0.0]|[1.0,0.0,0.0]|0.0 |
// |5.0 |2.0 |3.5 |1.0 |Iris-versicolor|[5.0,2.0,3.5,1.0]|1.0 |[0.0,33.0,0.0]|[0.0,1.0,0.0]|1.0 |
val acc: Double = new MulticlassClassificationEvaluator().setMetricName("accuracy").evaluate(testPre)
println("acc is ", acc)
println("err is", (1-acc))
// 9-将测试集预测的索引类别标签转变回字符串类型的
val indexToString: IndexToString = new IndexToString().setInputCol("prediction").setOutputCol("preStringLabel").setLabels(stingIndexModel.labels)
val result: DataFrame = indexToString.transform(testPre)
// result.show(false)
// +------------+-----------+------------+-----------+---------------+-----------------+-----+--------------+-------------------------------------------+----------+---------------+
// |sepal_length|sepal_width|petal_length|petal_width|class |features |label|rawPrediction |probability |prediction|preStringLabel |
// +------------+-----------+------------+-----------+---------------+-----------------+-----+--------------+-------------------------------------------+----------+---------------+
// |4.6 |3.6 |1.0 |0.2 |Iris-setosa |[4.6,3.6,1.0,0.2]|0.0 |[38.0,0.0,0.0]|[1.0,0.0,0.0] |0.0 |Iris-setosa |
// |4.8 |3.4 |1.6 |0.2 |Iris-setosa |[4.8,3.4,1.6,0.2]|0.0 |[38.0,0.0,0.0]|[1.0,0.0,0.0] |0.0 |Iris-setosa |
}
}