泰坦尼克号:幸存人数预测数据分析:原来也可以很简单



泰坦尼克号:数据分析

数据:12个字段;训练集892条记录;测试数据集:418条记录

?PassengerId => 乘客ID

?Pclass => 乘客等级(1/2/3等舱位)

?Name => 乘客姓名

?Sex => 性别

?Age => 年龄

?SibSp => 堂兄弟/妹个数

?Parch => 父母与小孩个数

?Ticket => 船票信息

?Fare => 票价

?Cabin => 客舱

?Embarked => 登船港口

 

目标:预测测试数据中418个人最终是否得救

结果:准确率97.37%

 

如此高的准确率,让人惊奇,但是看完歪果仁的python代码,感觉并没有使用很高深的算法,但是准确率就是内么的高,不得不佩服他们怎么能把如此复杂的问题,使用如此简单的办法解出来的。

主要思路: 主要利用 3个字段 性别,乘客等级,票价

 分别把票价分成四个级别:0~10,10~20,20~30,30~max

则依据性别,乘客等级,票价 计算不同组合的存活率

如下表格:有两个3*4列的矩阵,第一个表示女性,第一个行第一列代表的是 女性 1等级乘客:0~10票价段:的存活率 ,其他属性依次类推;第二个3*4列的矩阵表示的男性




survival_table


[[[ 0.         0.          0.83333333  0.97727273]


  [ 0.          0.91428571  0.9        1.        ]


  [ 0.59375     0.58139535 0.33333333  0.125     ]]


 


 [[ 0.          0.          0.4         0.38372093]


  [ 0.          0.15873016  0.16       0.21428571]


  [ 0.11153846  0.23684211 0.125       0.24      ]]]


 


从上表可以看出,女性的存活率整体要比男性存活率要大很多,这是由于欧洲当时很高贵的绅士精神,不管什么事儿童、女人、老人优先,所以才会如表格所展现的这个结果。


如下表格所示,凡是对应位置存活率>0.5的标记为1 小于的标记为0


survival_table13


[[[ 0.  0.  1.  1.]


  [ 0.  1. 1.  1.]


  [ 1.  1. 0.  0.]]


 


 [[ 0.  0. 0.  0.]


  [ 0.  0. 0.  0.]


  [ 0.  0. 0.  0.]]]


我们依此表格 即可对418条测试数据集进行测试,最后预测的准确率达到了97.37%


代码所需数据下载地址:https://www.kaggle.com/c/titanic/data


# coding=utf-8
""" Now that the user can read in a file this creates a model which uses the price, class and gender
Author : AstroDave
Date : 18th September 2012
Revised : 28 March 2014

"""


import csv as csv
import numpy as np

csv_file_object = csv.reader(open('train.csv', 'rb'))       # Load in the csv file
header = csv_file_object.next()                             # Skip the fist line as it is a header
data=[]                                                     # Create a variable to hold the data

for row in csv_file_object:                 # Skip through each row in the csv file
    data.append(row)                        # adding each row to the data variable
data = np.array(data)                       # Then convert from a list to an array

# In order to analyse the price column I need to bin up that data
# here are my binning parameters, the problem we face is some of the fares are very large
# So we can either have a lot of bins with nothing in them or we can just lose some
# information by just considering that anythng over 39 is simply in the last bin.
# So we add a ceiling
fare_ceiling = 40
# then modify the data in the Fare column to = 39, if it is greater or equal to the ceiling
data[ data[0::,9].astype(np.float) >= fare_ceiling, 9 ] = fare_ceiling - 1.0

fare_bracket_size = 10
number_of_price_brackets = fare_ceiling / fare_bracket_size
number_of_classes = 3                             # I know there were 1st, 2nd and 3rd classes on board.
number_of_classes = len(np.unique(data[0::,2]))   # But it's better practice to calculate this from the Pclass directly:
                                                  # just take the length of an array of UNIQUE values in column index 2


# This reference matrix will show the proportion of survivors as a sorted table of
# gender, class and ticket fare.
# First initialize it with all zeros
survival_table = np.zeros([2,number_of_classes,number_of_price_brackets],float)
# print 'survival_table \n',survival_table
# I can now find the stats of all the women and men on board
for i in xrange(number_of_classes):
    for j in xrange(number_of_price_brackets):

        women_only_stats = data[ (data[0::,4] == "female") \
                                 & (data[0::,2].astype(np.float) == i+1) \
                                 & (data[0:,9].astype(np.float) >= j*fare_bracket_size) \
                                 & (data[0:,9].astype(np.float) < (j+1)*fare_bracket_size), 1]

        men_only_stats = data[ (data[0::,4] != "female") \
                                 & (data[0::,2].astype(np.float) == i+1) \
                                 & (data[0:,9].astype(np.float) >= j*fare_bracket_size) \
                                 & (data[0:,9].astype(np.float) < (j+1)*fare_bracket_size), 1]

                                 #if i == 0 and j == 3:

        survival_table[0,i,j] = np.mean(women_only_stats.astype(np.float))  # Female stats 计算女性 在不同等级仓位 不同价位下的平均存活率
        survival_table[1,i,j] = np.mean(men_only_stats.astype(np.float))    # Male stats

# Since in python if it tries to find the mean of an array with nothing in it
# (such that the denominator is 0), then it returns nan, we can convert these to 0
# by just saying where does the array not equal the array, and set these to 0.
# print 'survival_table1 \n',survival_table
survival_table[ survival_table != survival_table ] = 0.
print 'survival_table12 \n',survival_table
# Now I have my proportion of survivors, simply round them such that if <0.5
# I predict they dont surivive, and if >= 0.5 they do
survival_table[ survival_table < 0.5 ] = 0
survival_table[ survival_table >= 0.5 ] = 1
print 'survival_table13 \n',survival_table
# Now I have my indicator I can read in the test file and write out
# if a women then survived(1) if a man then did not survived (0)
# First read in test
test_file = open('test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()

# Also open the a new file so I can write to it.
predictions_file = open("genderclassmodel.csv", "wb")
predictions_file_object = csv.writer(predictions_file)
predictions_file_object.writerow(["PassengerId", "Survived"])

# First thing to do is bin up the price file
sum=0
count=1
import linecache
linecache.clearcache()
for row in test_file_object:
    count+=1
    for j in xrange(number_of_price_brackets):
        # If there is no fare then place the price of the ticket according to class
        try:
            row[8] = float(row[8])    # No fare recorded will come up as a string so
                                      # try to make it a float
        except:                       # If fails then just bin the fare according to the class
            bin_fare = 3 - float(row[1])
            break                     # Break from the loop and move to the next row
        if row[8] > fare_ceiling:     # Otherwise now test to see if it is higher
                                      # than the fare ceiling we set earlier
            bin_fare = number_of_price_brackets - 1
            break                     # And then break to the next row

        if row[8] >= j*fare_bracket_size\
            and row[8] < (j+1)*fare_bracket_size:     # If passed these tests then loop through
                                                      # each bin until you find the right one
                                                      # append it to the bin_fare
                                                      # and move to the next loop
            bin_fare = j
            break
        # Now I have the binned fare, passenger class, and whether female or male, we can
        # just cross ref their details with our survival table
    if row[3] == 'female':
        if int(survival_table[ 0, float(row[1]) - 1, bin_fare ])==int( linecache.getline('gendermodel.csv', count).split(',')[1]):
            sum+=1
        predictions_file_object.writerow([row[0], "%d" % int(survival_table[ 0, float(row[1]) - 1, bin_fare ])])
    else:
        if int(survival_table[ 1, float(row[1]) - 1, bin_fare ])==int( linecache.getline('gendermodel.csv', count).split(',')[1]):
            sum+=1
        predictions_file_object.writerow([row[0], "%d" % int(survival_table[ 1, float(row[1]) - 1, bin_fare])])

# Close out the files

proportion_survived = sum /float(count-1)

print 'people number:%d accuracy number: %d'%(count,sum)
print 'Forecast accuracy  of people who survived is %s' % proportion_survived
# print 'Proportion of men who survived is %s' % proportion_men_survived
test_file.close()
predictions_file.close()

代码运行结果:

survival_table12 
[[[ 0.          0.          0.83333333  0.97727273]
  [ 0.          0.91428571  0.9         1.        ]
  [ 0.59375     0.58139535  0.33333333  0.125     ]]

 [[ 0.          0.          0.4         0.38372093]
  [ 0.          0.15873016  0.16        0.21428571]
  [ 0.11153846  0.23684211  0.125       0.24      ]]]
survival_table13 
[[[ 0.  0.  1.  1.]
  [ 0.  1.  1.  1.]
  [ 1.  1.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]]
people number:419 accuracy number: 407
Forecast accuracy  rate of people who survived is 0.973684210526



你可能感兴趣的:(算法实现)