泰坦尼克号,1912年4月2日完工试航,是当时世界上体积最庞大、内部设施最豪华的客运轮船,有“永不沉没”的美誉 。然而在她的处女航中,泰坦尼克号与一座冰山相撞,2224名船员及乘客中,逾1500人丧生。
用pandas读入文件后,我们可以看到数据格式如下:
这是Kaggle对于数据的解释:
调用 data_train.info()
,我们可以看到一些更详细的信息:
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
很明显,数据存在缺失,另外像Name这样的项对我们毫无帮助,因此我们需要对数据进行修补和筛选,找出真正有用的信息。
我们先作图直观感受一下数据的特征,代码如下:
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
data_train = pd.read_csv("E:/my_file/for_pycharm/titanic/Train.csv")
fig = plt.figure()
fig.set(alpha=0.2)
plt.subplot2grid((2, 3), (0, 0))
data_train.Survived.value_counts().plot(kind='bar')
plt.title(u"rescue situation (1means live)")
plt.ylabel(u"number")
plt.subplot2grid((2, 3), (0, 1))
data_train.Pclass.value_counts().plot(kind="bar")
plt.ylabel(u"number")
plt.title(u"passenger rank distribution")
plt.subplot2grid((2, 3), (0, 2))
plt.scatter(data_train.Survived, data_train.Age)
plt.ylabel(u"age")
plt.grid(b=True, which='major', axis='y')
plt.title(u"view distribution via age (1 means alive)")
plt.subplot2grid((2, 3), (1, 0), colspan=2)
data_train.Age[data_train.Pclass == 1].plot(kind='kde')
data_train.Age[data_train.Pclass == 2].plot(kind='kde')
data_train.Age[data_train.Pclass == 3].plot(kind='kde')
plt.xlabel(u"age") # plots an axis lable
plt.ylabel(u"density")
plt.title(u"age distribution in different rank")
plt.legend((u'top', u'2nd rank', u'3rd rank'), loc='best')
plt.subplot2grid((2, 3), (1, 2))
data_train.Embarked.value_counts().plot(kind='bar')
plt.title(u"number on board in different port")
plt.ylabel(u"number")
plt.show()
这是作出的图:
这下我们对数据就有一些直观的感受了。
现在考虑我们数据中的12个属性,ID,NAME 显然是无关的,这些属性我们不予考虑,SURVIVAL 是我们的分类结果,也不予考虑。除去以后,剩下的属性有:PCLASS,SEX,AGE,SIBSP,PARCH,TICKET,FARE,CABIN,EMBARKED
这里我们对SIBSP和PARCH进行统计,结果如下:
这里我用《机器学习实战》中的代码建立了一棵决策树,效果图如下:
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2)
return shannonEnt
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
if (infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataSet[0]) == 1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
return myTree
可以看到,决策树很好的总结了数据中的规律,并以一种简单的形式表现在了我们面前,这其中有一些有趣或者有深意的规律,请大家自行总结。
对测试数据做与训练数据相同的处理后,我用classify
函数对测试数据进行了预测
def classify(inputTree,featLabels,testVec):
firstStr = list(inputTree.keys())[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
key = testVec[featIndex]
valueOfFeat = secondDict[key]
if isinstance(valueOfFeat, dict):
classLabel = classify(valueOfFeat, featLabels, testVec)
else: classLabel = valueOfFeat
return classLabel
把结果提交到Kaggle:
我本人还是满意的,毕竟只是一个初稿,而且在数据处理时放弃了许多内容。回想起操作文件和数据的坎坷与艰辛,TAT, 尽力了……
就是这样,以上
worked by zzzzzr