【数据分析】图书馆数据-05读者类型聚类挖掘

      根据读者借阅图书的总册数进行分类可以大致了解借阅图书的积极性,那么还有那些因素影响着学生的借阅图书情况呢?不同类型的读者对图书的要求也是不同的,在阅读次数较多的分组中对读者进行再一次的分类,寻找读者阅读中的因素。
      导入现有的读者信息,包括读者性别、所在院系以及对应的借书书目信息。

# -*-coding:utf-8-*-

import pandas as pd
import numpy as np

pf = pd.read_csv('new_data.csv', encoding='gbk')
# print pf.head()

unit = pf['read_unit']
unit = unit.str.split(' ')  # 原来是空格分隔的
dapartment = unit.str[0]  # 学院
print dapartment
major = unit.str[1]  # 专业
print major

new_table = pf[['read_sex', 'book_id']]
# print new_table
new_table.insert(2, 'dapartment', dapartment)  # 插入学院
# new_table.insert(3, 'major', major)  # 插入专业
print new_table

sex = new_table['read_sex']
book = new_table['book_id']
dapartment = new_table['dapartment']
print sex, book, dapartment  # 输出三列

"""
    算法:获取标签
"""


def add_label(s):
    l = []
    m = []

    for i in range(len(s)):
        if i == 0:
            m = []
            l = [1]
        else:
            m.append(s[i - 1])
            if s[i] in m:
                if m.index(s[i]) == 0:
                    l.append(1)
                else:
                    l.append(l[m.index(s[i])])
            else:
                l.append(max(l) + 1)
    return l


sex_list = sex.tolist()  # 格式转换: Series->list
book_list = book.tolist()
dapartment_list = dapartment.tolist()
# print sex_list, book_list, dapartment_list

sex_list = add_label(sex_list)  # 通过算法获取标签
book_list = add_label(book_list)
dapartment_list = add_label(dapartment_list)
# print sex_list, book_list, dapartment_list

new_cluster = []
new_cluster.append(sex_list)  # 性别
new_cluster.append(book_list)  # 书目
new_cluster.append(dapartment_list)  # 学院
m = np.array(new_cluster).T  # list转换为矩阵
print 'm:', m
print '----------------------------------------------'

from sklearn.cluster import KMeans

np.set_printoptions(threshold='nan')  # 将省略号里面的全部打印出来

kmeans = KMeans(n_clusters=5)
print kmeans
predict = kmeans.fit_predict(m)
print predict
print len(predict)

你可能感兴趣的:(数据分析-图书馆)