机器学习之决策树算法(1)

决策树算法是一种有监督的机器学习算法,它的结构类似于流程图的树状结构。由节点和有向图组成,节点分为叶子节点和内部节点。叶子节点代表样本的类别,内部节点表示一个特征或者属性。根节点到叶子节点的每一天路径构建一条规则。而叶子节点代表对应的规则的结论。

信息熵:有标签和样本特征来计算得到。

信息增益:标签的信息熵减去样本特征的信息熵。越大特征越优。下面我们用kaggle上大赛的数据来预测泰坦尼克号幸存者。

数据连接:https://download.csdn.net/download/qq_36581957/10814246

第一步,先对数据进行预处理。预处理的过程跟个人的理解有关。

import pandas as pd;
def DataAnalyse():
    data=pd.read_csv("./titanic/train.csv");
    """数据中有些对我们完全没有用的信息,我们要去掉,比如:名字,票号,船舱号,样本的ID号"""
    data.drop(["PassengerId","Cabin","Ticket","Name","Embarked"],axis=1,inplace=True);#删除了四个,我们还有7个特征。其中一个是标签
    """对性别进行编码"""
    data["Sex"]=(data['Sex']=='male').astype("int")
    """处理登船港口"""
    #labels=data["Embarked"].unique().tolist()
    #data["Embarked"]=data["Embarked"].apply(lambda n:labels.index(n))
    """数据中有一些没有值得,我们全部补0"""
    data=data.fillna(0)
    data.info();  # 12个特征量,去掉表头,还剩891个样本。、
    Y_train=data["Survived"]
    data.drop(["Survived"],axis=1,inplace=True)#在本身上操作。
    X_train=data;
    return X_train,Y_train;
if __name__ == '__main__':
    X_train,Y_train=DataAnalyse();
    X_train.info()

大概可以看一下运行结果:


RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null int32
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
dtypes: float64(2), int32(1), int64(4)
memory usage: 45.3 KB

RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
Pclass    891 non-null int64
Sex       891 non-null int32
Age       891 non-null float64
SibSp     891 non-null int64
Parch     891 non-null int64
Fare      891 non-null float64
dtypes: float64(2), int32(1), int64(3)
memory usage: 38.4 KB

 训练集处理完以后,我们可以构建决策树模型。

from sklearn.model_selection import train_test_split
def datasplit(X,Y):
    x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2);
    return x_train,x_test,y_train,y_test;
from sklearn.tree import DecisionTreeClassifier
def DecisionTree(x_train, x_test, y_train, y_test):
    clf=DecisionTreeClassifier()
    clf.fit(x_train,y_train)
    train_score=clf.score(x_train,y_train)
    test_score=clf.score(x_test,y_test)
    return train_score,test_score;
if __name__ == '__main__':
    X_train,Y_train=DataAnalyse();
    x_train, x_test, y_train, y_test=datasplit(X_train,Y_train)
    train_score, test_score=DecisionTree(x_train, x_test, y_train, y_test);
    print(train_score,test_score)

运行结果:

0.9859550561797753 0.8044692737430168

从运行结果上可以看出,训练得分和测试得分有很高的差距,这是过拟合现象。我们在接下来的博文中继续优化我们的实例 

 

你可能感兴趣的:(scikit-learn,TensorFlow入门和应用,scikit-learn)