本次任务为使用k-means算法对青蛙叫声MFCC数据集进行聚类分析。给定数据全部为有标数据,共分为四类。由于数据维数较高,可视化之前需要降维,这里采用t-sne算法降维,最后使用matplotlib将聚类结果可视化。数据集链接及完整源代码在文末给出。
KMeans
算法通过把样本分离成 n 个具有相同方差的类的方式来聚集数据,最小化称为 惯量(inertia) 或 簇内平方和(within-cluster sum-of-squares)的标准(criterion)。该算法需要指定簇的数量。它可以很好地扩展到大量样本(large number of samples),并已经被广泛应用于许多不同领域的应用领域。
k-means 算法将一组 N N N样本 X X X划分成 K K K个不相交的簇 C C C, 每个都用该簇中的样本的均值 μ j \mu_j μj 描述。 这个均值(means)通常被称为簇的 “质心(centroids)”; 注意,它们一般不是从 X X X 中挑选出的点,虽然它们是处在同一个空间。
K-means(K-均值)算法旨在选择一个质心, 能够最小化惯性或簇内平方和的标准:
惯性被认为是测量簇内聚程度的度量(measure)。 它有各种缺点:
init='k-means++'
参数)。 这将初始化质心(通常)彼此远离,导致比随机初始化更好的结果。依然是用pandas
frog_data = pd.read_csv("../datas/Frogs_MFCCs.csv")
first_set = frog_data[['MFCCs_ 1', 'MFCCs_ 5', 'MFCCs_ 9', 'MFCCs_13', 'MFCCs_17', 'MFCCs_21']]
import pandas as pd
from sklearn import cluster
fit一下
model = cluster.KMeans(n_clusters=4, max_iter=100, n_jobs=4, init="k-means++")
model.fit(data_set)
r = pd.concat([data_set, pd.Series(model.labels_, index=data_set.index)], axis=1)
r.columns = list(data_set.columns) + [u'聚类类别']
print(r)
r.to_excel(output_file)
from sklearn import metrics
p_labels = list(model.labels_)
with open(score_file, "a") as sf:
sf.write("By k-means, the f-m_score of " + set_name + " is: " + str(metrics.fowlkes_mallows_score(t_labels, p_labels))+"\n")
sf.write("By k-means, the rand_score of " + set_name + " is: " + str(metrics.adjusted_rand_score(t_labels, p_labels))+"\n")
from sklearn.manifold import TSNE
t_sne = TSNE()
t_sne.fit(data_set)
t_sne = pd.DataFrame(t_sne.embedding_, index=data_set.index)
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
dd = t_sne[r[u'聚类类别'] == 0]
plt.plot(dd[0], dd[1], 'r.')
dd = t_sne[r[u'聚类类别'] == 1]
plt.plot(dd[0], dd[1], 'go')
dd = t_sne[r[u'聚类类别'] == 2]
plt.plot(dd[0], dd[1], 'b*')
dd = t_sne[r[u'聚类类别'] == 3]
plt.plot(dd[0], dd[1], 'o')
plt.savefig(png_file)
plt.clf()
second_set = frog_data[['MFCCs_ 3', 'MFCCs_ 7', 'MFCCs_11', 'MFCCs_15', 'MFCCs_19']]
k_means(second_set, "../output/kMeansSet_2.xlsx", "../output/kMeansSet_2.png", tLabel, scoreFile, "Set_2")
import pandas as pd
from sklearn import cluster
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
def k_means(data_set, output_file, png_file, t_labels, score_file, set_name):
model = cluster.KMeans(n_clusters=4, max_iter=100, n_jobs=4, init="k-means++")
model.fit(data_set)
# print(list(model.labels_))
p_labels = list(model.labels_)
r = pd.concat([data_set, pd.Series(model.labels_, index=data_set.index)], axis=1)
r.columns = list(data_set.columns) + [u'聚类类别']
print(r)
r.to_excel(output_file)
with open(score_file, "a") as sf:
sf.write("By k-means, the f-m_score of " + set_name + " is: " + str(metrics.fowlkes_mallows_score(t_labels, p_labels))+"\n")
sf.write("By k-means, the rand_score of " + set_name + " is: " + str(metrics.adjusted_rand_score(t_labels, p_labels))+"\n")
t_sne = TSNE()
t_sne.fit(data_set)
t_sne = pd.DataFrame(t_sne.embedding_, index=data_set.index)
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
dd = t_sne[r[u'聚类类别'] == 0]
plt.plot(dd[0], dd[1], 'r.')
dd = t_sne[r[u'聚类类别'] == 1]
plt.plot(dd[0], dd[1], 'go')
dd = t_sne[r[u'聚类类别'] == 2]
plt.plot(dd[0], dd[1], 'b*')
dd = t_sne[r[u'聚类类别'] == 3]
plt.plot(dd[0], dd[1], 'o')
plt.savefig(png_file)
plt.clf()
frog_data = pd.read_csv("../datas/Frogs_MFCCs.csv")
tLabel = []
for family in frog_data['Family']:
if family == "Leptodactylidae":
tLabel.append(0)
elif family == "Dendrobatidae":
tLabel.append(1)
elif family == "Hylidae":
tLabel.append(2)
else:
tLabel.append(3)
scoreFile = "../output/scoreOfClustering.txt"
first_set = frog_data[['MFCCs_ 1', 'MFCCs_ 5', 'MFCCs_ 9', 'MFCCs_13', 'MFCCs_17', 'MFCCs_21']]
k_means(first_set, "../output/kMeansSet_1.xlsx", "../output/kMeansSet_1.png", tLabel, scoreFile, "Set_1")
https://download.csdn.net/download/swy_swy_swy/12409033