Titanic: Machine Learning from Disaster(Kaggle 数据挖掘竞赛)

Predict survival on the Titanic (with tutorials in Excel, Python, R, and an introduction to Random Forests)

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

比赛链接https://www.kaggle.com/c/titanic-gettingStarted

在udacity上学完了intro to maching learning后,来Kaggle上实践一下。第一次做kaggle的比赛,收获非常大!开始时茫茫没头绪,就学tutorials里的代码, 大致明白了解这种题的框架是个什么样子。然后从头实现自己的代码,做出了第一个结果后,准确率偏低,不知道该从哪个方面下手提高准确率。然后去看论坛的讨论,看别人写的解题博客,还要经常查sklearn和panda的文档。

感觉在数据挖掘竞赛中,如果大家都用同样的算法库,影响最终结果的,就是三个因素(因为没法获得更多的数据):1、数据的预处理方式 2、特征工程 3、算法的参数调整。 数据的处理包括丢失数据的处理,数据类型的转换。特征工程则是最复杂的部分,有人可以利用已有的特征挖掘出更多的特征,有的特征是多余的,这个过程中有很多方法。参数调整可以利用gridsearch这样的工具,但还是乏味无聊的。毕竟第一次做比赛,经验还太少,有些体会可能也是不对的,如果有错,期待大家的批评指正。也期待有朋友可以交流这个比赛!一个完成基本任务的代码如下。

import pandas as pd
import csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
 
# read in the training and testing data into Pandas.DataFrame objects
input_df = pd.read_csv('train.csv', header=0)
submit_df  = pd.read_csv('test.csv',  header=0)

# merge the two DataFrames into one
df = pd.concat([input_df, submit_df])
df = df .reset_index()
df = df.drop('index', axis=1)
df = df.reindex_axis(input_df.columns, axis=1)

#print df.info()
#print df.describe()

""" cope with the missing data """

# 1.'Cabin', seems useless, too much missing, two ways to handle it
# (1) drop this feature
df = df.drop('Cabin', axis=1)
# (2) assign a value represent it's missing
#df[df['Cabin'].isnull()] = 'missing'

# 2. 'Age', a important feature, we can't just drop it, two ways to handle it
# (1) fill it with median
df['Age'][df['Age'].isnull()] = df['Age'].median()
# (2) use a simple model to predict it, after everything else is fixed.In this case,
#  the accuracy of the experiment consequence is too low,so we give up this method.

# 3. 'Fare', only one miss, just use the median
df['Fare'][df['Fare'].isnull()] = df['Fare'].median()

# 4.'Embarked', two  miss, use the most frequency word
df['Embarked'][df['Embarked'].isnull()] = df['Embarked'].mode().values


""" feature engineering """

# 1. 'Sex', factorizing it
df['Sex'] = pd.factorize(df['Sex'])[0]

# 2. 'Name', we can drag some title from the name, for simplicity, I just drop it 
df = df.drop('Name', axis=1)

# 3. 'Ticket', for simlicity, just drop it
df = df.drop('Ticket', axis=1)

# 4. 'Embarked', factorizing it  
df['Embarked'] = pd.factorize(df['Embarked'])[0]

# 5. 'SibSp', 'Parch', combine them to get a new feature, 'Famliy_size'
df['Family_size'] = df['SibSp']+df['Parch']


""" train the model """

df = df.drop(['PassengerId', 'SibSp', 'Parch'], axis=1)
known_survived = df[df['Survived'].notnull()].values
unknown_survived = df[df['Survived'].isnull()].values

Y = known_survived[:, 0]
X = known_survived[:, 1:]

clf = RandomForestClassifier(n_estimators=1000, random_state=312, min_samples_leaf=3).fit(X, Y)
#print cross_val_score(clf, X, Y).mean()
#print df.columns
#print  clf.feature_importances_


""" predict and write to csv file """

pred = clf.predict(unknown_survived[:, 1:]).astype(int)
ids = submit_df['PassengerId']

predictions_file = open("myfirstsubmission.csv", "wb")
open_file_object = csv.writer(predictions_file)
open_file_object.writerow(["PassengerId","Survived"])
open_file_object.writerows(zip(ids, pred))
predictions_file.close()

print 'done.'





你可能感兴趣的:(数据挖掘)