2020-07-21

我们使用均值漂移,继续聚类和非监督学习的话题,这次将其用于我们的泰坦尼克数据集。

这里有一些随机度,所以你的结果可能并不相同,然而你可以重新运行程序来获取相似结果,如果你没有得到相似结果的话。

我们打算通过均值漂移聚类来看一看泰坦尼克数据集。我们感兴趣的是,是否均值漂移能够自动将乘客分离为分组。如果能,检查它创建的分组就很有趣了。第一个明显的兴趣点就是,所发现分组的幸存率,但是,我们也会深入这些分组的属性,来观察我们是否能够理解,均值漂移为什么决定了特定的分组。

首先,我们使用已经看过的代码:

import numpy as np
from sklearn.cluster import MeanShift, KMeans
from sklearn import preprocessing, cross_validation
import pandas as pd
import matplotlib.pyplot as plt


'''
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival (0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare (British pound)
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
'''


# https://pythonprogramming.net/static/downloads/machine-learning-data/titanic.xls
df = pd.read_excel('titanic.xls')

original_df = pd.DataFrame.copy(df)
df.drop(['body','name'], 1, inplace=True)
df.fillna(0,inplace=True)

def handle_non_numerical_data(df):
    
    # handling non-numerical data: must convert.
    columns = df.columns.values

    for column in columns:
        text_digit_vals = {}
        def convert_to_int(val):
            return text_digit_vals[val]

        #print(column,df[column].dtype)
        if df[column].dtype != np.int64 and df[column].dtype != np.float64:
            
            column_contents = df[column].values.tolist()
            #finding just the uniques
            unique_elements = set(column_contents)
            # great, found them. 
            x = 0
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    # creating dict that contains new
                    # id per unique string
                    text_digit_vals[unique] = x
                    x+=1
            # now we map the new "id" vlaue
            # to replace the string. 
            df[column] = list(map(convert_to_int,df[column]))

    return df

df = handle_non_numerical_data(df)
df.drop(['ticket','home.dest'], 1, inplace=True)

X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])

clf = MeanShift()
clf.fit(X)

除了两个例外,一个是original_df = pd.DataFrame.copy(df),在我们将csv文件读取到df对象之后。另一个是从sklearn.cluster导入MeanShift,并且用其作为我们的聚类器。我们生成了副本,以便之后引用原始非数值形式的数据。

既然我们创建了拟合,我们可以从clf对象获取一些属性。

labels = clf.labels_
cluster_centers = clf.cluster_centers_

下面,我们打算向我们的原始数据帧添加新的一项。

original_df['cluster_group']=np.nan

现在,我们可以迭代标签,并向空列添加新的标签。

for i in range(len(X)):
    original_df['cluster_group'].iloc[i] = labels[i]

现在我们可以检查每个分组的幸存率:

n_clusters_ = len(np.unique(labels))
survival_rates = {}
for i in range(n_clusters_):
    temp_df = original_df[ (original_df['cluster_group']==float(i)) ]
    #print(temp_df.head())

    survival_cluster = temp_df[  (temp_df['survived'] == 1) ]

    survival_rate = len(survival_cluster) / len(temp_df)
    #print(i,survival_rate)
    survival_rates[i] = survival_rate
    
print(survival_rates)

如果我们执行它,我们会得到:

{0: 0.3796583850931677, 1: 0.9090909090909091, 2: 0.1}

同样,你可能获得更多分组。我这里获得了三个,但是我在这个数据集上获得过六个分组。现在,我们看到分组 0 的幸存率是 38%,分组 1 是 91%,分组 2 是 10%。这就有些奇怪了,因为我们知道船上有三个真实的“乘客分类”。我想知道是不是 0 就是二等舱,1 就是头等舱, 2 是三等舱。船上的舱是,3 等舱在最底下,头等舱在最上面,底部首先淹没,然后顶部是救生船的地方。我可以深入看一看:

print(original_df[ (original_df['cluster_group']==1) ])

我们获取cluster_group为 1 的original_df

打印出来:

     pclass  survived                                               name  \
17        1         1    Baxter, Mrs. James (Helene DeLaudeniere Chaput)   
49        1         1                 Cardeza, Mr. Thomas Drake Martinez   
50        1         1  Cardeza, Mrs. James Warburton Martinez (Charlo...   
66        1         1                        Chaudanson, Miss. Victorine   
97        1         1  Douglas, Mrs. Frederick Charles (Mary Helene B...   
116       1         1                Fortune, Mrs. Mark (Mary McDougald)   
183       1         1                             Lesurer, Mr. Gustave J   
251       1         1              Ryerson, Miss. Susan Parker "Suzette"   
252       1         0                         Ryerson, Mr. Arthur Larned   
253       1         1    Ryerson, Mrs. Arthur Larned (Emily Maria Borie)   
302       1         1                                   Ward, Miss. Anna   

        sex   age  sibsp  parch    ticket      fare            cabin embarked  \
17   female  50.0      0      1  PC 17558  247.5208          B58 B60        C   
49     male  36.0      0      1  PC 17755  512.3292      B51 B53 B55        C   
50   female  58.0      0      1  PC 17755  512.3292      B51 B53 B55        C   
66   female  36.0      0      0  PC 17608  262.3750              B61        C   
97   female  27.0      1      1  PC 17558  247.5208          B58 B60        C   
116  female  60.0      1      4     19950  263.0000      C23 C25 C27        S   
183    male  35.0      0      0  PC 17755  512.3292             B101        C   
251  female  21.0      2      2  PC 17608  262.3750  B57 B59 B63 B66        C   
252    male  61.0      1      3  PC 17608  262.3750  B57 B59 B63 B66        C   
253  female  48.0      1      3  PC 17608  262.3750  B57 B59 B63 B66        C   
302  female  35.0      0      0  PC 17755  512.3292              NaN        C   

    boat  body                                       home.dest  cluster_group  
17     6   NaN                                    Montreal, PQ            1.0  
49     3   NaN  Austria-Hungary / Germantown, Philadelphia, PA            1.0  
50     3   NaN                    Germantown, Philadelphia, PA            1.0  
66     4   NaN                                             NaN            1.0  
97     6   NaN                                    Montreal, PQ            1.0  
116   10   NaN                                    Winnipeg, MB            1.0  
183    3   NaN                                             NaN            1.0  
251    4   NaN                 Haverford, PA / Cooperstown, NY            1.0  
252  NaN   NaN                 Haverford, PA / Cooperstown, NY            1.0  
253    4   NaN                 Haverford, PA / Cooperstown, NY            1.0  
302    3   NaN                                             NaN            1.0 

很确定了,整个分组就是头等舱。也就是说,这里实际上只有 11 个人。让我们看看分组 0,它看起来有些不同。这一次,我们使用 Pandas 的.describe()方法。

print(original_df[ (original_df['cluster_group']==0) ].describe())
            pclass     survived          age        sibsp        parch  \
count  1288.000000  1288.000000  1027.000000  1288.000000  1288.000000   
mean      2.300466     0.379658    29.668614     0.496118     0.332298   
std       0.833785     0.485490    14.395610     1.047430     0.686068   
min       1.000000     0.000000     0.166700     0.000000     0.000000   
25%       2.000000     0.000000    21.000000     0.000000     0.000000   
50%       3.000000     0.000000    28.000000     0.000000     0.000000   
75%       3.000000     1.000000    38.000000     1.000000     0.000000   
max       3.000000     1.000000    80.000000     8.000000     4.000000   

              fare        body  cluster_group  
count  1287.000000  119.000000         1288.0  
mean     30.510172  159.571429            0.0  
std      41.511032   97.302914            0.0  
min       0.000000    1.000000            0.0  
25%       7.895800   71.000000            0.0  
50%      14.108300  155.000000            0.0  
75%      30.070800  255.500000            0.0  
max     263.000000  328.000000            0.0  

这里有 1287 个人,我们可以看到平均等级是二等舱,但是这里从头等到三等都有。

让我们检查最后一个分组,2,它的预期是全都是三等舱:

print(original_df[ (original_df['cluster_group']==2) ].describe())
       pclass   survived        age      sibsp      parch       fare  \
count    10.0  10.000000   8.000000  10.000000  10.000000  10.000000   
mean      3.0   0.100000  39.875000   0.800000   6.000000  42.703750   
std       0.0   0.316228   1.552648   0.421637   1.632993  15.590194   
min       3.0   0.000000  38.000000   0.000000   5.000000  29.125000   
25%       3.0   0.000000  39.000000   1.000000   5.000000  31.303125   
50%       3.0   0.000000  39.500000   1.000000   5.000000  35.537500   
75%       3.0   0.000000  40.250000   1.000000   6.000000  46.900000   
max       3.0   1.000000  43.000000   1.000000   9.000000  69.550000   

             body  cluster_group  
count    2.000000           10.0  
mean   234.500000            2.0  
std    130.814755            0.0  
min    142.000000            2.0  
25%    188.250000            2.0  
50%    234.500000            2.0  
75%    280.750000            2.0  
max    327.000000            2.0  

很确定了,我们是对的,这个分组全是三等舱,所以有最坏的幸存率。

足够有趣,在查看所有分组的时候,分组 2 的票价范围的确是最低的,从 29 到 69 磅。

在我们查看簇 0 的时候,票价最高为 263 磅。这是最大的组,幸存率为 38%。

当我们回顾簇 1 时,它全是头等舱,我们看到这里的票价范围是 247 ~ 512 磅,均值为 350。尽管簇 0 有一些头等舱的乘客,这个分组是最精英的分组。

出于好奇,分组 0 的头等舱的生存率,与整体生存率相比如何呢?

>>> cluster_0 = (original_df[ (original_df['cluster_group']==0) ])
>>> cluster_0_fc = (cluster_0[ (cluster_0['pclass']==1) ])
>>> print(cluster_0_fc.describe())
       pclass    survived         age       sibsp       parch        fare  \
count   312.0  312.000000  273.000000  312.000000  312.000000  312.000000   
mean      1.0    0.608974   39.027167    0.432692    0.326923   78.232519   
std       0.0    0.488764   14.589592    0.606997    0.653100   60.300654   
min       1.0    0.000000    0.916700    0.000000    0.000000    0.000000   
25%       1.0    0.000000   28.000000    0.000000    0.000000   30.500000   
50%       1.0    1.000000   39.000000    0.000000    0.000000   58.689600   
75%       1.0    1.000000   49.000000    1.000000    0.000000   91.079200   
max       1.0    1.000000   80.000000    3.000000    4.000000  263.000000   

             body  cluster_group  
count   35.000000          312.0  
mean   162.828571            0.0  
std     82.652172            0.0  
min     16.000000            0.0  
25%    109.500000            0.0  
50%    166.000000            0.0  
75%    233.000000            0.0  
max    307.000000            0.0  
>>> 

很确定了,它们的幸存率更高,约为 61%,但是仍然低于精英分组(根据票价和幸存率)的 91%。花费一些时间来深入挖掘,看看你是否能发现一些东西。

你可能感兴趣的:(2020-07-21)