Kaggle竞赛题目之——Titanic: Machine Learning from Disaster

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

题目链接:https://www.kaggle.com/c/titanic-gettingStarted

简述:泰坦尼克号之沉没,造成1504名乘客及船员丧生。造成此悲剧的一个重要原因是救生衣远远不够。虽然有幸运因素,但是一些群体诸如妇女、儿童和上层阶级更容易幸存。在此题中,我们要你分析什么人更容易幸存。

通过书本、电影等之前的认知,知道妇女、儿童是有优先权的。同样的在训练数据,可以计算出妇女的幸存率。

#!/usr/bin/env python
#coding:utf-8
'''
Created on 2014年11月25日
@author: zhaohf
'''
import pandas as pd

df = pd.read_csv('../Data/train.csv',header=0)
female_tourist = len(df[df['Sex'] == 'female'])
female_survived = len(df[(df['Sex'] == 'female') & (df['Survived'] == 1)])
print female_survived * 1.0 / female_tourist
结果是:0.742038216561。

所以,可以粗鲁地认为,只要是妇女,基本就是能够存活的。

下面就用以上的简单规则预测一下测试数据:

tf = pd.read_csv('../Data/test.csv',header=0)
ntf = tf.iloc[:,[0,3]]
ntf['Gender'] = ntf['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
ids = ntf['PassengerId'].values
predicts = ntf['Gender'].values
predictions_file = open("../Submissions/gender_submission.csv", "wb")
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId","Survived"])
open_file_object.writerows(zip(ids, predicts))
predictions_file.close()
提交结果后的得分是: 0.76555。

说明准确率还马马虎虎,猜测如果再加上社会等级、年龄,可能更加准确。

下面做一些数据的清洗,因为有很多值是非数字或有误。下面显示的是非空的数量,训练数据中共有891行记,年龄、Carbin空的较多。

import pandas as pd
df = pd.read_csv('../Data/train.csv',header=0)
print df.info()

Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)None

其中,姓名、船票等非数字的信息在后续的处理中没有用到,将这几列其移除,同时对NaN进行处理。年龄为NaN的改成平均年龄好了。

df = df.drop(['Ticket','Name','Cabin','Embarked'],axis=1)
m = np.ma.masked_array(df['Age'], np.isnan(df['Age']))
mean = np.mean(m).astype(int)
df['Age'] = df['Age'].map(lambda x : mean if np.isnan(x) else x)
df['Sex'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

对于测试文件也要进行同样的预处理,才能应用到模型中。

下面使用决策树进行模型训练。

X = df.values
y = df['Survived'].values
X = np.delete(X,1,axis=1)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.3,random_state=0)
dt = tree.DecisionTreeClassifier(max_depth=5)
dt.fit(X_train, y_train)
print dt.score(X_test,y_test)
得分较为理想的话,就用来对测试文件进行预测。

全部的代码:

#!/usr/bin/env python
#coding:utf-8
'''
Created on 2014年11月25日
@author: zhaohf
'''
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn import cross_validation
import csv
df = pd.read_csv('../Data/train.csv',header=0)
df = df.drop(['Ticket','Name','Cabin','Embarked'],axis=1)
m = np.ma.masked_array(df['Age'], np.isnan(df['Age']))
mean = np.mean(m).astype(int)
df['Age'] = df['Age'].map(lambda x : mean if np.isnan(x) else x)
df['Sex'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
X = df.values
y = df['Survived'].values
X = np.delete(X,1,axis=1)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.3,random_state=0)
dt = tree.DecisionTreeClassifier(max_depth=5)
dt.fit(X_train, y_train)
print dt.score(X_test,y_test)

test = pd.read_csv('../Data/test.csv',header=0)
tf = test.drop(['Ticket','Name','Cabin','Embarked'],axis=1)
m = np.ma.masked_array(tf['Age'], np.isnan(tf['Age']))
mean = np.mean(m).astype(int)
tf['Age'] = tf['Age'].map(lambda x : mean if np.isnan(x) else int(x))
tf['Sex'] = tf['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
tf['Fare'] = tf['Fare'].map(lambda x : 0 if np.isnan(x) else int(x)).astype(int)
predicts = dt.predict(tf)
ids = tf['PassengerId'].values
predictions_file = open("../Submissions/dt_submission.csv", "wb")
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId","Survived"])
open_file_object.writerows(zip(ids, predicts))
predictions_file.close()

下面是生成的决策树的各节点的重要度,也即信息熵,第三个对应的就是性别,占了一半以上的重要性,也说明性别在这次灾难中对于幸存的重要性。

[ 0.06664883  0.14876052  0.52117953  0.10608185  0.08553209  0.00525581 0.06654137]
最后的得分是0.78469。也算是很低的分数了,有人能做到100%正确!

你可能感兴趣的:(机器学习,learning,machine)