推荐系统
key:用户
value:播放量
查询该用户是否在字典中,若在则更新value,否则新用户加入字典
数据中会存在很多惰性用户,行为量很少,可能会对拉低结果的平均效果,所以需要对播放量排序,筛选
output_dict = {}
with open(data_home+'train_triplets.txt') as f:
for line_number, line in enumerate(f):
user = line.split('\t')[0]
play_count = int(line.split('\t')[2])
if user in output_dict:
play_count +=output_dict[user]
output_dict.update({user:play_count})
output_dict.update({user:play_count})
output_list = [{'user':k,'play_count':v} for k,v in output_dict.items()]
play_count_df = pd.DataFrame(output_list)
play_count_df = play_count_df.sort_values(by = 'play_count', ascending = False)
分别统计它的播放总量,与上面相同的做法
用户:按大小排好序,取一部分用户的作为我们的实验数据
依据:前n个用户播放量总和 / 音乐全部的播放量总和 = 前n个用户行为占全部用户行为的百分比
取个百分比的阈值即可滤去惰性用户
total_play_count = sum(song_count_df.play_count)
print ((float(play_count_df.head(n=100000).play_count.sum())/total_play_count)*100)
play_count_subset = play_count_df.head(n=100000)
取数据:取10W个用户,3W首歌
user_subset = list(play_count_subset.user)
song_subset = list(song_count_subset.song)
对原文件过滤:
triplet_dataset = pd.read_csv(filepath_or_buffer=data_home+'train_triplets.txt',sep='\t',
header=None, names=['user','song','play_count'])
triplet_dataset_sub = triplet_dataset[triplet_dataset.user.isin(user_subset) ]
del(triplet_dataset)
triplet_dataset_sub_song = triplet_dataset_sub[triplet_dataset_sub.song.isin(song_subset)]
del(triplet_dataset_sub)
音乐详细信息存在数据库中,首先连接数据库,将正常用户的信息提取出来,并转成csv文件。
conn = sqlite3.connect(data_home+'track_metadata.db')
cur = conn.cursor()
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
cur.fetchall()
track_metadata_df = pd.read_sql(con=conn, sql='select * from songs')
track_metadata_df_sub = track_metadata_df[track_metadata_df.song_id.isin(song_subset)]
得到详细信息之后,将详细信息表和用户表通过歌名merge,并清除无用信息,得到最终数据。
del(track_metadata_df_sub['track_id'])
del(track_metadata_df_sub['artist_mbid'])
track_metadata_df_sub = track_metadata_df_sub.drop_duplicates(['song_id'])
triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song, track_metadata_df_sub, how='left', left_on='song', right_on='song_id')
triplet_dataset_sub_song_merged.rename(columns={'play_count':'listen_count'},inplace=True)
del(triplet_dataset_sub_song_merged['song_id'])
del(triplet_dataset_sub_song_merged['artist_id'])
del(triplet_dataset_sub_song_merged['duration'])
del(triplet_dataset_sub_song_merged['artist_familiarity'])
del(triplet_dataset_sub_song_merged['artist_hotttnesss'])
del(triplet_dataset_sub_song_merged['track_7digitalid'])
del(triplet_dataset_sub_song_merged['shs_perf'])
del(triplet_dataset_sub_song_merged['shs_work'])
triplet_dataset_sub_song_merged.head(n=10)
简单暴力,排行榜单推荐:根据播放量进行排名推荐
# 1、获得用户歌单
# 2、获取全部歌曲
# 3、构建相似矩阵:
设用户歌单有100首歌,歌单库有10000首,构建100x10000的矩阵
对于i行j列的值,即为用户歌单的第i首歌,与歌单库的第j首歌之间的相似度(基于用户)
相似度的计算:将听过第i首歌的用户与听过第j首歌的用户做交集/并集,数值越大,则相似度越大
# 4、将矩阵的值全部算出,然后对每一列的值求和取均值,即为用户歌单与库中的某首歌的相似程度,哪一首均值高则推荐哪一首
矩阵构建:
#Construct cooccurence matrix
def construct_cooccurence_matrix(self, user_songs, all_songs):
####################################
#Get users for all songs in user_songs.
####################################
user_songs_users = []
for i in range(0, len(user_songs)):
user_songs_users.append(self.get_item_users(user_songs[i]))
###############################################
#Initialize the item cooccurence matrix of size
#len(user_songs) X len(songs)
###############################################
cooccurence_matrix = np.matrix(np.zeros(shape=(len(user_songs), len(all_songs))), float)
#############################################################
#Calculate similarity between user songs and all unique songs
#in the training data
#############################################################
for i in range(0,len(all_songs)):
#Calculate unique listeners (users) of song (item) i
songs_i_data = self.train_data[self.train_data[self.item_id] == all_songs[i]]
users_i = set(songs_i_data[self.user_id].unique())
for j in range(0,len(user_songs)):
#Get unique listeners (users) of song (item) j
users_j = user_songs_users[j]
#Calculate intersection of listeners of songs i and j
users_intersection = users_i.intersection(users_j)
#Calculate cooccurence_matrix[i,j] as Jaccard Index
if len(users_intersection) != 0:
#Calculate union of listeners of songs i and j
users_union = users_i.union(users_j)
cooccurence_matrix[j,i] = float(len(users_intersection))/float(len(users_union))
else:
cooccurence_matrix[j,i] = 0
return cooccurence_matrix
先计算歌曲 ( 被当前用户播放量 / 用户播放总量 ) 当做分值(相当于对电影的评分)
*歌曲被当前用户播放量占比越高,则用户越喜欢这首歌曲
triplet_dataset_sub_song_merged_sum_df=
triplet_dataset_sub_song_merged[['user','listen_count']].groupby('user').sum().reset_index()
triplet_dataset_sub_song_merged_sum_df.rename(columns={'listen_count':'total_listen_count'},inplace=True)
triplet_dataset_sub_song_merged = pd.merge(triplet_dataset_sub_song_merged,triplet_dataset_sub_song_merged_sum_df)
triplet_dataset_sub_song_merged.head()
triplet_dataset_sub_song_merged['fractional_play_count'] = triplet_dataset_sub_song_merged['listen_count']/triplet_dataset_sub_song_merged['total_listen_count']
将用户、歌曲、信息的名字设置为较短的字符,并转成稀疏矩阵
from scipy.sparse import coo_matrix
# 将用户、歌曲、信息的名字设置为较短的字符
small_set = triplet_dataset_sub_song_merged
user_codes = small_set.user.drop_duplicates().reset_index()
song_codes = small_set.song.drop_duplicates().reset_index()
user_codes.rename(columns={'index':'user_index'}, inplace=True)
song_codes.rename(columns={'index':'song_index'}, inplace=True)
# 将值换为之前的index
song_codes['so_index_value'] = list(song_codes.index)
user_codes['us_index_value'] = list(user_codes.index)
small_set = pd.merge(small_set,song_codes,how='left')
small_set = pd.merge(small_set,user_codes,how='left')
mat_candidate = small_set[['us_index_value','so_index_value','fractional_play_count']]
data_array = mat_candidate.fractional_play_count.values
row_array = mat_candidate.us_index_value.values
col_array = mat_candidate.so_index_value.values
data_sparse = coo_matrix((data_array, (row_array, col_array)),dtype=float)
将用户得分的稀疏矩阵进行svd分解
import math as mt
from scipy.sparse.linalg import * #used for matrix multiplication
from scipy.sparse.linalg import svds
from scipy.sparse import csc_matrix
def compute_svd(urm, K):
U, s, Vt = svds(urm, K)
# 转换的S为50个值,将其转换成矩阵对角线的形式
dim = (len(s), len(s))
S = np.zeros(dim, dtype=np.float32)
for i in range(0, len(s)):
S[i,i] = mt.sqrt(s[i])
# 转换成稀疏矩阵的形式
U = csc_matrix(U, dtype=np.float32)
S = csc_matrix(S, dtype=np.float32)
Vt = csc_matrix(Vt, dtype=np.float32)
return U, S, Vt
# svd特征值的数量
K=50
# 用户得分的稀疏矩阵
urm = data_sparse
# 最大歌曲数量
MAX_PID = urm.shape[1]
# 最大用户数量
MAX_UID = urm.shape[0]
U, S, Vt = compute_svd(urm, K)
核心部分:将分解得到的U对应的推荐用户的行数据,乘上SxV,得到还原的用户得分数据,还原的数据中会将用户没有评分的歌曲(即稀疏的地方)更新一个新的值,并将这些值排序取最大的前10个,即为需要给用户推荐的歌曲。
def compute_estimated_matrix(urm, U, S, Vt, uTest, K, test):
rightTerm = S*Vt
max_recommendation = 10
estimatedRatings = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
recomendRatings = np.zeros(shape=(MAX_UID,max_recommendation ), dtype=np.float16)
for userTest in uTest:
prod = U[userTest, :]*rightTerm
estimatedRatings[userTest, :] = prod.todense()
recomendRatings[userTest, :] = (-estimatedRatings[userTest, :]).argsort()[:max_recommendation]
return recomendRatings
# 推荐的用户
uTest = [4,5,6,7,8,873,23]
uTest_recommended_items = compute_estimated_matrix(urm, U, S, Vt, uTest, K, True)
推荐结果:
for user in uTest:
print("Recommendation for user with user id {}". format(user))
rank_value = 1
for i in uTest_recommended_items[user,0:10]:
song_details = small_set[small_set.so_index_value == i].drop_duplicates('so_index_value')[['title','artist_name']]
print("The number {} recommended song is {} BY {}".format(rank_value, list(song_details['title'])[0],list(song_details['artist_name'])[0]))
rank_value+=1
Recommendation for user with user id 4
The number 1 recommended song is Fireflies BY Charttraxx Karaoke
The number 2 recommended song is Hey_ Soul Sister BY Train
The number 3 recommended song is OMG BY Usher featuring will.i.am
The number 4 recommended song is Lucky (Album Version) BY Jason Mraz & Colbie Caillat
The number 5 recommended song is Vanilla Twilight BY Owl City
The number 6 recommended song is Crumpshit BY Philippe Rochard
The number 7 recommended song is Billionaire [feat. Bruno Mars] (Explicit Album Version) BY Travie McCoy
The number 8 recommended song is Love Story BY Taylor Swift
The number 9 recommended song is TULENLIEKKI BY M.A. Numminen
The number 10 recommended song is Use Somebody BY Kings Of Leon
Recommendation for user with user id 5
The number 1 recommended song is Sehr kosmisch BY Harmonia
The number 2 recommended song is Ain’t Misbehavin BY Sam Cooke
The number 3 recommended song is Dog Days Are Over (Radio Edit) BY Florence + The Machine
The number 4 recommended song is Revelry BY Kings Of Leon
The number 5 recommended song is Undo BY Björk
The number 6 recommended song is Cosmic Love BY Florence + The Machine
The number 7 recommended song is Home BY Edward Sharpe & The Magnetic Zeros
The number 8 recommended song is You’ve Got The Love BY Florence + The Machine
The number 9 recommended song is Bring Me To Life BY Evanescence
The number 10 recommended song is Tighten Up BY The Black Keys
Recommendation for user with user id 6
The number 1 recommended song is Crumpshit BY Philippe Rochard
The number 2 recommended song is Marry Me BY Train
The number 3 recommended song is Hey_ Soul Sister BY Train
The number 4 recommended song is Lucky (Album Version) BY Jason Mraz & Colbie Caillat
The number 5 recommended song is One On One BY the bird and the bee
The number 6 recommended song is I Never Told You BY Colbie Caillat
The number 7 recommended song is Canada BY Five Iron Frenzy
The number 8 recommended song is Fireflies BY Charttraxx Karaoke
The number 9 recommended song is TULENLIEKKI BY M.A. Numminen
The number 10 recommended song is Bring Me To Life BY Evanescence
Recommendation for user with user id 7
The number 1 recommended song is Behind The Sea [Live In Chicago] BY Panic At The Disco
The number 2 recommended song is The City Is At War (Album Version) BY Cobra Starship
The number 3 recommended song is Dead Souls BY Nine Inch Nails
The number 4 recommended song is Una Confusion BY LU
The number 5 recommended song is Home BY Edward Sharpe & The Magnetic Zeros
The number 6 recommended song is Climbing Up The Walls BY Radiohead
The number 7 recommended song is Tighten Up BY The Black Keys
The number 8 recommended song is Tive Sim BY Cartola
The number 9 recommended song is West One (Shine On Me) BY The Ruts
The number 10 recommended song is Cosmic Love BY Florence + The Machine
Recommendation for user with user id 8
The number 1 recommended song is Undo BY Björk
The number 2 recommended song is Canada BY Five Iron Frenzy
The number 3 recommended song is Better To Reign In Hell BY Cradle Of Filth
The number 4 recommended song is Unite (2009 Digital Remaster) BY Beastie Boys
The number 5 recommended song is Behind The Sea [Live In Chicago] BY Panic At The Disco
The number 6 recommended song is Rockin’ Around The Christmas Tree BY Brenda Lee
The number 7 recommended song is Devil’s Slide BY Joe Satriani
The number 8 recommended song is Revelry BY Kings Of Leon
The number 9 recommended song is 16 Candles BY The Crests
The number 10 recommended song is Catch You Baby (Steve Pitron & Max Sanna Radio Edit) BY Lonnie Gordon
Recommendation for user with user id 873
The number 1 recommended song is The Scientist BY Coldplay
The number 2 recommended song is Yellow BY Coldplay
The number 3 recommended song is Clocks BY Coldplay
The number 4 recommended song is Fix You BY Coldplay
The number 5 recommended song is In My Place BY Coldplay
The number 6 recommended song is Shiver BY Coldplay
The number 7 recommended song is Speed Of Sound BY Coldplay
The number 8 recommended song is Creep (Explicit) BY Radiohead
The number 9 recommended song is Sparks BY Coldplay
The number 10 recommended song is Use Somebody BY Kings Of Leon
Recommendation for user with user id 23
The number 1 recommended song is Garden Of Eden BY Guns N’ Roses
The number 2 recommended song is Don’t Speak BY John Dahlbäck
The number 3 recommended song is Master Of Puppets BY Metallica
The number 4 recommended song is TULENLIEKKI BY M.A. Numminen
The number 5 recommended song is Bring Me To Life BY Evanescence
The number 6 recommended song is Kryptonite BY 3 Doors Down
The number 7 recommended song is Make Her Say BY Kid Cudi / Kanye West / Common
The number 8 recommended song is Night Village BY Deep Forest
The number 9 recommended song is Better To Reign In Hell BY Cradle Of Filth
The number 10 recommended song is Xanadu BY Olivia Newton-John;Electric Light Orchestra