Titanic数据分析(部分)

Titanic

文章来源:https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy
Machine Learning的hello world

How a Data Scientist Beat the Odds克服困难

A Data Science Framework

  1. Define the Problem: 不要本末倒置,don’t put the cart before the horse. Problems before requirements, requirements before solutions, solutions before design, and design before technology
  2. Gather the Data: 收集数据
  3. Prepare Data for Consumption: 数据整理
  4. Perform Exploratory Analysis: 初步分析,garbage-in, garbage-out ,descriptive and graphical statistics
  5. Model Data: 选择正确的模型
  6. Validate and Implement Data Model: 确定模型是否正确拟合数据
  7. Optimize and Strategize:

对于 Titanic问题,以上步骤为:

  1. 定义问题:预测乘客是否存活

  2. 收集数据:已有

  3. 数据整理:

    1. import libraries

    2. Load Data Modelling Libraries

    3. Meet&Greet Data

      导入数据,info() and sample(),先粗看下数据

      1. more predictor variables do not make a better model, but the right variables,不是所有特征都是有用特征
      2. passengeId和ticket是无用特征
      3. pClass有序数字,代表社会阶层
      4. name列可以通过名推测性别,姓推测兄弟,头衔推测阶层
      5. 性别和职业,可以转为数字类型可以计算
      6. 年龄和票价是连续数字
      7. SibSp兄弟数量和Parch父母和孩子数量,可以用来特征工程生成家庭数量,也可做单独特征
      8. Cabin船舱,可以用来推测在船的未知,但是因为空值太多,舍弃这个特征
      # 深度复制一份源数据,只用来分析
      data1 = data_raw.copy(deep = True)
      
    4. 数据清洗的4C

      1. Correcting:检视数据,确定没有异常和不可接受值。比如age=800,改变原值一定要慎重
      2. Completing:缺失值,很多算法不允许缺失值,决策树可以有缺失值。两种处理方法–删除该条,或使用合理的数字填补。除非缺失很多,不建议删除;一般采用mean, median, or mean + randomized standard deviation(随机标准差),
      3. Creating:使用既有特征,生成新特征; 这里增加title特征
      4. Converting:格式化
      # train中每一列缺失值的数量
      train.isnull().sum()
      # train数据的描述,最大最小标准差等
      train.describe(include = 'all')
      
      #清洗
      	#complete missing age with median
          dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
          #complete embarked with mode
          dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)
          #complete missing fare with median
          dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
          # 舍弃3列
          drop_column = ['PassengerId','Cabin', 'Ticket']
      	data1.drop(drop_column, axis=1, inplace = True)
      #特征工程
      	#Discrete variables
          dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1
      
          dataset['IsAlone'] = 1 #initialize to yes/1 is alone
          dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1
          #quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split
          dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
          
          #Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut
          #Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
          dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)
          #Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html
          dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)
          
      #code categorical data
      	from sklearn.preprocessing import OneHotEncoder, LabelEncoder
      	label = LabelEncoder()
      	dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
          dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
          dataset['Title_Code'] = label.fit_transform(dataset['Title'])
          dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
          dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])
      
    5. 分开训练集和测试集 (sklearn’s train_test_split function., 7.5-2.5)

    #define x variables for original features aka feature selection
    data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts
    data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation
    #define x variables for original w/bin features to remove continuous variables
    data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
    #define x and y variables for dummy features original
    data1_dummy = pd.get_dummies(data1[data1_x])
    data1_x_dummy = data1_dummy.columns.tolist()
    data1_xy_dummy = Target + data1_x_dummy
    
    #split train and test data with function defaults
    #random_state -> seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
    train1_x, test1_x, train1_y, test1_y = model_selection.train_test_split(data1[data1_x_calc], data1[Target], random_state = 0)
    train1_x_bin, test1_x_bin, train1_y_bin, test1_y_bin = model_selection.train_test_split(data1[data1_x_bin], data1[Target] , random_state = 0)
    train1_x_dummy, test1_x_dummy, train1_y_dummy, test1_y_dummy = model_selection.train_test_split(data1_dummy[data1_x_dummy], data1[Target], random_state = 0)
    
  4. 探索分析, 发现分类相关特征以及它们与目标的相关性

#Discrete Variable Correlation by Survival using
#group by aka pivot table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
for x in data1_x:
    if data1[x].dtype != 'float64' :
        print('Survival Correlation by:', x)
        print(data1[[x, Target[0]]].groupby(x, as_index=False).mean())
        print('-'*10, '\n')
        
        
#IMPORTANT: Intentionally plotted different ways for learning purposes only. 

#optional plotting w/pandas: https://pandas.pydata.org/pandas-docs/stable/visualization.html

#we will use matplotlib.pyplot: https://matplotlib.org/api/pyplot_api.html

#to organize our graphics will use figure: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure
#subplot: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot and subplotS: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html?highlight=matplotlib%20pyplot%20subplots#matplotlib.pyplot.subplots
plt.figure(figsize=[16,12])
plt.subplot(234)
plt.hist(x = [data1[data1['Survived']==1]['Fare'], data1[data1['Survived']==0]['Fare']], 
         stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Fare Histogram by Survival')
plt.xlabel('Fare ($)')
plt.ylabel('# of Passengers')   直方图
plt.legend()
#plot distributions of age of passengers who survived or did not survive
a = sns.FacetGrid( data1, hue = 'Survived', aspect=4 )
a.map(sns.kdeplot, 'Age', shade= True )
a.set(xlim=(0 , data1['Age'].max()))  分布图
a.add_legend()

#histogram comparison of sex, class, and age by survival
h = sns.FacetGrid(data1, row = 'Sex', col = 'Pclass', hue = 'Survived')
h.map(plt.hist, 'Age', alpha = .75)  直方对比图
h.add_legend()

#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=15)   相关系数热力图

correlation_heatmap(data1)
  1. 模型化数据

mathematics , computer science and business management这三种知识。—重点是WHY you do that?而不是直接拿来用。

算法大致分为四种–分类、回归、聚类和降维。 本文用分类方法:

  • Ensemble Methods
  • Generalized Linear Models (GLM)
  • Naive Bayes
  • Nearest Neighbors
  • Support Vector Machines (SVM)
  • Decision Trees
  • Discriminant Analysis

To this the beginner must learn, the No Free Lunch Theorem (NFLT) of Machine Learning. In short, NFLT states, there is no super algorithm, that works best in all situations, for all datasets. So the best approach is to try multiple MLAs, tune them, and compare them for your specific scenario.

recommend starting with Trees, Bagging, Random Forests, and Boosting. They are basically different implementations of a decision tree, which is the easiest concept to learn and understand.

MLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),
    
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),
    
    #Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),

    
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    XGBClassifier()    
    ]
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 ) # run model 10x with 60/30 split intentionally leaving out 10%

#create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)

#create table to compare MLA predictions
MLA_predict = data1[Target]

#index through MLA and save performance to table
row_index = 0
for alg in MLA:

    #set name and parameters
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
    
    #score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
    cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv  = cv_split)

    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
    MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()   
    #if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
    MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3   #let's know the worst that can happen!
    

    #save MLA predictions - see section 6 for usage
    alg.fit(data1[data1_x_bin], data1[Target])
    MLA_predict[MLA_name] = alg.predict(data1[data1_x_bin])
    
    row_index+=1

    
#print and sort table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare

你可能感兴趣的:(技术收获,机器学习,算法)