关于Kaggle-Titanic: Machine Learning from Disaster的机器学习报告(初稿)


1 问题介绍

    泰坦尼克号,1912年4月2日完工试航,是当时世界上体积最庞大、内部设施最豪华的客运轮船,有“永不沉没”的美誉 。然而在她的处女航中,泰坦尼克号与一座冰山相撞,2224名船员及乘客中,逾1500人丧生。


2 认识数据

    用pandas读入文件后,我们可以看到数据格式如下:

    这是Kaggle对于数据的解释:

    调用 data_train.info(),我们可以看到一些更详细的信息:

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

    很明显,数据存在缺失,另外像Name这样的项对我们毫无帮助,因此我们需要对数据进行修补和筛选,找出真正有用的信息。


3 分析和处理数据

    我们先作图直观感受一下数据的特征,代码如下:

import pandas as pd     
import numpy as np      
from pandas import  Series,DataFrame
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor

data_train = pd.read_csv("E:/my_file/for_pycharm/titanic/Train.csv")

fig = plt.figure()
fig.set(alpha=0.2)  

plt.subplot2grid((2, 3), (0, 0))  
data_train.Survived.value_counts().plot(kind='bar')  
plt.title(u"rescue situation (1means live)")  
plt.ylabel(u"number")

plt.subplot2grid((2, 3), (0, 1))
data_train.Pclass.value_counts().plot(kind="bar")
plt.ylabel(u"number")
plt.title(u"passenger rank distribution")

plt.subplot2grid((2, 3), (0, 2))
plt.scatter(data_train.Survived, data_train.Age)
plt.ylabel(u"age")  
plt.grid(b=True, which='major', axis='y')
plt.title(u"view distribution via age (1 means alive)")

plt.subplot2grid((2, 3), (1, 0), colspan=2)
data_train.Age[data_train.Pclass == 1].plot(kind='kde')
data_train.Age[data_train.Pclass == 2].plot(kind='kde')
data_train.Age[data_train.Pclass == 3].plot(kind='kde')
plt.xlabel(u"age")  # plots an axis lable
plt.ylabel(u"density")
plt.title(u"age distribution in different rank")
plt.legend((u'top', u'2nd rank', u'3rd rank'), loc='best')  

plt.subplot2grid((2, 3), (1, 2))
data_train.Embarked.value_counts().plot(kind='bar')
plt.title(u"number on board in different port")
plt.ylabel(u"number")
plt.show()

  这是作出的图:

  这下我们对数据就有一些直观的感受了。
  现在考虑我们数据中的12个属性,ID,NAME 显然是无关的,这些属性我们不予考虑,SURVIVAL 是我们的分类结果,也不予考虑。除去以后,剩下的属性有:PCLASS,SEX,AGE,SIBSP,PARCH,TICKET,FARE,CABIN,EMBARKED
  这里我们对SIBSPPARCH进行统计,结果如下:



  从统计结果上看,表亲的有无和家人的有无对一个人的存活情况影响似乎不大,都是50%,所以这两项我们弃置不用。
  另外 FARE这一项的跨度有些庞大而且难以区分,这里我也弃置不用。
  对于 TICKET 这一项,由于票的编号跨度太大而且难以统计,这里我们不予考虑
  对于 CABIN 这一项,891项数据中只有204项有值,这里我忽略那些丢失的数据,把他分成两类,一类是 CABIN 中有数据的,分类为1,没有数据的分类为0。
  对于 AGE 这一项,考虑到照顾老幼病残(真实的泰坦尼克号事件中也是如此),我把年龄小于16岁或者年龄大于60岁的分为类别1,剩下的分为类别2.
  这样一来,我们的数据就只剩下了 PCLASS, SEX, AGE, CABIN, EMBARKED五项数据,而且都是离散性数据,接下来我要做的就是用这五项数据训练我的分类器。


4 决策树的建立

  这里我用《机器学习实战》中的代码建立了一棵决策树,效果图如下:


这里是代码:

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: 
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) 
    return shannonEnt

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)       
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy      
        if (infoGain > bestInfoGain):       
            bestInfoGain = infoGain         
            bestFeature = i
    return bestFeature                      

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]
    if len(dataSet[0]) == 1: 
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]      
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,     bestFeat, value),subLabels)
    return myTree           

  可以看到,决策树很好的总结了数据中的规律,并以一种简单的形式表现在了我们面前,这其中有一些有趣或者有深意的规律,请大家自行总结。


5 验证结果

  对测试数据做与训练数据相同的处理后,我用classify 函数对测试数据进行了预测

def classify(inputTree,featLabels,testVec):
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict): 
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else: classLabel = valueOfFeat
    return classLabel

  把结果提交到Kaggle:



6 结论

  我本人还是满意的,毕竟只是一个初稿,而且在数据处理时放弃了许多内容。回想起操作文件和数据的坎坷与艰辛,TAT, 尽力了……
  就是这样,以上


   worked by zzzzzr

你可能感兴趣的:(机器学习)