python k-means聚类

K-Means聚类思想

  1. 随机选K个点作为中心
  2. 根据剩下点与选出的K个中心点的距离,归入最近的类
  3. 重新计算所有点的均值作为中心
  4. 重复2,3直至聚类中心不再发生改变

python实现:

import numpy as np
from sklearn.cluster import KMeans
def loadData(filePath):
    fr = open(filePath,'r+')
    lines = fr.readlines()
    retData=[]
    retCityName = []
    for line in lines:
        items = line.strip().split(',')#文件预处理,划分名字和消费水平
        retCityName.append(items[0])
        retData.append([float(items[i]) for i in range(1,len(items))])
    return retData,retCityName
if __name__=='__main__':
    data,cityName = loadData('city.txt')
    km = KMeans(n_clusters=3)#聚成三类
    label = km.fit_predict(data)#打上相应标签
    expenses = np.sum(km.cluster_centers_,axis=1)#计算每一行的和(即每个类的消费水平)
    print(expenses)
    CityCluster = [[],[],[]]
    for i in range(len(cityName)):
        CityCluster[label[i]].append(cityName[i])#对应的标签中放入相应的城市名称
    for i in range(len(CityCluster)):
        print('Expenses:%.2f'% expenses[i])#输出每类消费水平
        print(CityCluster[i])

DBSCAN密度聚类

特点:聚类的时候不需要预先制定簇的个数,因此最终类别数不定
DBSCAN算法将数据点分为三类:
1. 核心点:在半径Eps内含有超过MinPts数目的点
2. 边界点:在半径Eps内点的数量少于MinPts,但是落在核心点的领域内
3. 噪音点:非以上两类的点
DBSCAN算法流程:
1. 将所有点标记为核心点,边界点或噪声点(对每个点计算领域Eps=3内的点的集合,集合内点的个数超过MinPts=3的点为核心点,剩余点若在核心点的领域内,则为边界点,不在则为噪声点)
2. 删除噪声点
3. 为距离在Eps之内的所有核心点赋予一条边
4. 每组连通的核心点形成一个类
5. 将每个边界点指派到一个与之关联的核心点的类中(某个核心点的半径范围之内)
python实现

import numpy as np
import sklearn.cluster as skc
from sklearn import metrics
import matplotlib.pyplot as plt

mac2id = dict()
onlinetimes = []
with open('TestData.txt',encoding = 'utf-8') as f:
    for line in f:
        mac = line.split(',')[2]#读取MAC地址
        onlinetime = int(line.split(',')[6])#读取上网时长
        starttime = int(line.split(',')[4].split(' ')[1].split(':')[0])#读取上网时间(小时)
        if mac not in mac2id:
            mac2id[mac] = len(onlinetimes)
            onlinetimes.append((starttime,onlinetime))
        else:
            onlinetimes[mac2id[mac]] = [(starttime,onlinetime)]
    real_X = np.array(onlinetimes).reshape((-1,2))#右对齐
x = real_X[:,0:1]
db = skc.DBSCAN(eps = 0.01,min_samples=20).fit(x)
labels = db.labels_
print('Lables:')
print(labels)
raito = len(labels[labels[:]==-1])/len(labels)
print('noise raito',format(raito,'.2%'))
n_clusters_ = len(set(labels))-(1 if -1 in labels else 0)
print('Estimated number of clusters:%d' % n_clusters_)
print('silhouette cofficient:%0.3f' % metrics.silhouette_score(x,labels))
for i in range(n_clusters_):
    print('cluster',i,':')
    print(list(x[labels == i].flatten()))
plt.hist(x,24)
x = np.log(1+real_X[:,1:])
db = skc.DBSCAN(eps = 0.14,min_samples = 10).fit(x)
labels = db.labels_
print('Lables:')
print(labels)
raito = len(labels[labels[:]==-1])/len(labels)
print('noise raito:',format(raito,'.2%'))
n_clusters_=len(set(labels))-(1 if -1 in labels else 0)
print('Estimated number of clusters:%d' % n_clusters_)
print('silhouette cofficient:%0.3f' % metrics.silhouette_score(x,labels))
for i in range(n_clusters_):
    print('cluster',i,':')
    cout = len(x[labels==i])
    mean = np.mean(real_X[labels == i][:,1])
    std = np.std(real_X[labels == i][:,1])
    print('\t number of sample:',cout)
    print('\t mean of sample:',format(mean,'.1f'))
    print('\t std of sample:',format(std,'.1f'))

实验代码和数据打包下载通道:点击跳转下载页

你可能感兴趣的:(machine,learning,python)