python实现kmeans聚类不使用科学计算包_用python实现单词包模型的简单kmeans聚类

输入数据集如下所示:{"666": ["abc",

"xyz"],

"888": ["xxxo",

"xxxo"],

"007": ["abc"]}

我们首先使用以下函数创建一个单词包模型:def associate_terms_with_user(unique_term_set, all_users_terms_dict):

associated_value_return_dict = {}

# consider the first user

for user_id in all_users_terms_dict:

# what terms *could* this user have possibly used

this_user_zero_vector = []

# this could be refactored somehow

for term in unique_term_set:

this_user_zero_vector.extend('0')

# what terms *did* this user use

terms_belong_to_this_user = all_users_terms_dict.get(user_id)

# let's start counting all the possible terms that this term in the personal

# user list of words could correspond to...

global_term_element_index = 0

# while this one term is in the range of all possible terms

while global_term_element_index < len(unique_term_set):

# start counting the number of terms he used

local_term_set_item_index = 0

# if this one term he used is still in the range of terms he used, counting them one by one

while local_term_set_item_index < len(terms_belong_to_this_user):

# if this one user term is the same as this one global term

if list(unique_term_set)[global_term_element_index] == terms_belong_to_this_user[local_term_set_item_index]:

# increment the number of times this user used this term

this_user_zero_vector[global_term_element_index] = '1'

# go to the next term for this user

local_term_set_item_index += 1

# go to the next term in the global list of all possible terms

global_term_element_index += 1

associated_value_return_dict.update({user_id: this_user_zero_vector})

pprint.pprint(associated_value_return_dict)

程序的输出如下所示:{'007': ['0', '0', '1'],

'666': ['0', '1', '1'],

'888': ['1', '0', '0']}

我们如何实现一个简单的函数来根据向量之间的相似性对它们进行聚类?我设想使用k-means并可能使用scikit learn。

我以前从来没有这样做过,我不知道怎么做,我对机器学习很陌生,我甚至不知道从哪里开始。

最后666和007可能会聚在一起,而888可能单独在一个集群中,不是吗?

完整的代码存在于here。

你可能感兴趣的:(python实现kmeans聚类不使用科学计算包_用python实现单词包模型的简单kmeans聚类)