Competition Description 问题描述如下:
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
首先在kaggle网站上下载三个csv文件,提交样本submission.csv,训练文件train.csv, 以及测试文件test.csv
我们先来看一下训练文件的数据集,用read_csv()读入train.csv, 使用head()方法(列出表的前五行)观察数据集
该训练集一共12列,除去PassengerId 11列,10列为独立变量,survived为我们的目标变量,即因变量
将训练集和测试集分别加载进 DataFrame 之后,保存目标变量
因为只想保留数据集中的独立变量和特征,所以紧接着在DataFrame 中删除了Survived
训练集和测试集中添加了一个新的临时列('training_set'),是为了将两个数据集连接在一起,放在同一个 DataFrame 中,接下来填充缺失的数值,并通过独热编码(One-Hot Encoding)将分类特征转换为数字特征。
接下来将两个数据集分开,去掉临时列,构建一个有 100 个树的随机森林(通常,树越多结果越好,但这也意味着训练时间的增加),使用计算机的所有 CPU 核心(n_jobs=-1),使用训练集进行拟合,用拟合的随机森林来预测测试集的目标变量
这里的预测结果为小数,而文档中规定,生还为1 遇难为0,我使用了astype将结果转化为最接近的int值
把结果和它们各自的 Id 放在一个 DataFrame 中,并保存到 一个csv文件中
登陆 Kaggle 页面提交csv文件即可