K-Means算法的Python实现和Matlab实现

参考:http://blog.topspeedsnail.com/archives/10349

Python版本:python3.6.2

Matlab版本:

一、K-Means算法的Python实现

这里以 泰坦尼克号遇难者名单为例,通过除survived以外字段进行聚类(k=2,生/死),然后再和survived进行对比。
数据集下载地址: http://blog.topspeedsnail.com/wp-content/uploads/2016/11/titanic.xls
#参考自:http://blog.topspeedsnail.com/archives/10349
import numpy as np
from sklearn.cluster import KMeans
from sklearn import preprocessing
import pandas as pd

"""
数据集:titanic.xls(泰坦尼克号遇难者/幸存者名单)

***字段***
pclass: 社会阶层(1,精英;2,中产;3,船员/劳苦大众)
survived: 是否幸存
name: 名字
sex: 性别
age: 年龄
sibsp: 哥哥姐姐个数
parch: 父母儿女个数
ticket: 船票号
fare: 船票价钱
cabin: 船舱
embarked
boat
body: 尸体
home.dest
******
目的:使用除survived字段外的数据进行k-means分组(分成两组:生/死),然后和survived字段对比,看看分组效果。
"""

#加载数据
df = pd.read_excel("titanic.xls")
#print (df.shape)
#print(df.head())
#print(df.tail())
"""
    pclass  survived                                            name     sex  \
0       1         1                    Allen, Miss. Elisabeth Walton  female
1       1         1                   Allison, Master. Hudson Trevor    male
2       1         0                     Allison, Miss. Helen Loraine  female
3       1         0             Allison, Mr. Hudson Joshua Creighton    male
4       1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female

       age  sibsp  parch  ticket      fare    cabin embarked boat   body  \
0  29.0000      0      0   24160  211.3375       B5        S    2    NaN
1   0.9167      1      2  113781  151.5500  C22 C26        S   11    NaN
2   2.0000      1      2  113781  151.5500  C22 C26        S  NaN    NaN
3  30.0000      1      2  113781  151.5500  C22 C26        S  NaN  135.0
4  25.0000      1      2  113781  151.5500  C22 C26        S  NaN    NaN

    home.dest
0                     St Louis, MO
1  Montreal, PQ / Chesterville, ON
2  Montreal, PQ / Chesterville, ON
3  Montreal, PQ / Chesterville, ON
4  Montreal, PQ / Chesterville, ON
"""

#去掉无用字段
df.drop(['body','name','ticket'],1,inplace=True)

df.infer_objects()
df.fillna(0,inplace=True)  #把NaN替换为0

#把字符串映射为数字,例如:female:1,male:0
df_map = {} #保存映射
cols = df.columns.values
for col in cols:
    if df[col].dtype != np.int64 and df[col].dtype != np.float64:
        temp = {}
        x=0
        for ele in set(df[col].values.tolist()):
            if ele not in temp:
                temp[ele] = x
                x += 1

        df_map[df[col].name] = temp
        df[col] = list(map(lambda val:temp[val],df[col]))

x = np.array(df.drop(['survived'],1).astype(float))
x = preprocessing.scale(x)

clf = KMeans(n_clusters=2)
clf.fit(x)

y = np.array(df['survived'])

correct = 0
for i in range(len(x)):
    predict_data = np.array(x[i].astype(float))
    predict_data = predict_data.reshape(-1,len(predict_data))
    predict = clf.predict(predict_data)
    #print(predict[0],y[i])
    if predict[0] == y[i]:
        correct +=1

print(correct*1.0/len(x))

这是我的运行结果:

"D:\Program files\python3.6.2\python.exe" D:/sunfl/sunflower/study/机器学习/聚类算法/K-Means/k_means.py
0.2987012987012987#随机分配 生:0 死:1或者生:1 死:0故可能差别会很大,再用1-就行


进程已结束,退出代码0

你可能感兴趣的:(机器学习)