MovieLens是一组从20世纪90年代末到21世纪初的由MovieLens用户提供的电影评分数据。这些数据其中包括了电影评分、电影元数据(类型风格和年代)以及关于用户的人口统计学数据(年龄、邮编、性别和职业)。基于机器学习算法的推荐系统一般都会对此类数据感兴趣,这里将会告诉读者如何对数据进行切片切块以满足实际需求。
import pandas as pd
path = 'C:\\...\\pydata-book-1st-edition\\ch02\\movielens'
spl = '/'
path = spl.join(path.split('\\'))
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table(path+'/users.dat', sep='::', header = None, names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(path+'/ratings.dat', sep='::', header=None, names = rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(path+'/movies.dat', sep = '::', header=None, names = mnames)
首先分别从3个文件中读取数据,并存放于users, ratings, movies三个pandas的table格式中
同样的,数据可以从该书的github网页上获取:https://github.com/wesm/pydata-book/tree/1st-edition/ch02/movielens
数据呈现如下:
movies[:5]
Out[60]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
ratings[:5]
Out[61]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
users[:5]
Out[62]:
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
不难想象,如果同时处理三个文件,工作量一定比处理一个合并的文件大,而碰巧我们发现这三个表有相同的列(users和ratings都有user_id;movies和ratings都有movie_id)就可以通过pandas自带的merge方法来合并。
否则会出现MergeError: No common columns to perform merge on
data = pd.merge(pd.merge(ratings, users), movies)
得到如下data
data[:5]
Out[65]:
user_id movie_id rating timestamp gender age occupation zip \
0 1 1193 5 978300760 F 1 10 48067
1 2 1193 5 978298413 M 56 16 70072
2 12 1193 4 978220179 M 25 12 32793
3 15 1193 4 978199279 M 25 7 22903
4 17 1193 5 978158471 M 50 1 95350
title genres
0 One Flew Over the Cuckoo's Nest (1975) Drama
1 One Flew Over the Cuckoo's Nest (1975) Drama
2 One Flew Over the Cuckoo's Nest (1975) Drama
3 One Flew Over the Cuckoo's Nest (1975) Drama
4 One Flew Over the Cuckoo's Nest (1975) Drama
mean_ratings = data.pivot_table(values='rating', index=['title'],
columns=['gender'], aggfunc=np.mean)
关于该函数的具体用法可参考附录1
得到如下:
mean_ratings[:5]
Out[70]:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
首先求出每种电影评分人数,这通过size函数可以比较容易获得
ratings_by_title = data.groupby('title').size()
再使用Python常用的列表功能就可以筛选出这些电影来
active_titles = ratings_by_title.index[ratings_by_title >= 250]
之后,在mean_ratings中把这些电影选出来(注意到我们前面是重新建立了一个pandas数据类型)
active_titles = ratings_by_title.index[ratings_by_title >= 250]
得到如下
active_titles[:5]
Out[77]:
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
'101 Dalmatians (1961)', '101 Dalmatians (1996)',
'12 Angry Men (1957)'],
dtype='object', name='title')
top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)
得到如下
top_female_ratings[:5]
Out[78]:
gender F M
title
Close Shave, A (1995) 4.644444 4.473795
Wrong Trousers, The (1993) 4.588235 4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
Wallace & Gromit: The Best of Aardman Animation... 4.563107 4.385075
Schindler's List (1993)
mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']
sorted_by_diff = mean_ratings.sort_index(by='diff')
得到如下
sorted_by_diff[:5]
Out[79]:
gender F M diff
title
Dirty Dancing (1987) 3.790378 2.959596 -0.830782
Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
Grease (1978) 3.975265 3.367041 -0.608224
Little Women (1994) 3.870588 3.321739 -0.548849
Steel Magnolias (1989) 3.901734 3.365957 -0.535777
若希望得到反序的序列
sorted_by_diff[::-1][:15]
Out[80]:
gender F M diff
title
Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
Dumb & Dumber (1994) 2.697987 3.336595 0.638608
Longest Day, The (1962) 3.411765 4.031447 0.619682
Cable Guy, The (1996) 2.250000 2.863787 0.613787
Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985
Hidden, The (1987) 3.137931 3.745098 0.607167
Rocky III (1982) 2.361702 2.943503 0.581801
Caddyshack (1980) 3.396135 3.969737 0.573602
For a Few Dollars More (1965) 3.409091 3.953795 0.544704
Porky's (1981) 2.296875 2.836364 0.539489
Animal House (1978) 3.628906 4.167192 0.538286
Exorcist, The (1973) 3.537634 4.067239 0.529605
Fright Night (1985) 2.973684 3.500000 0.526316
Barb Wire (1996) 1.585366 2.100386 0.515020
rating_std_by_title = data.groupby('title')['rating'].std()
rating_std_by_title = rating_std_by_title.ix[active_titles]
得到如下
rating_std_by_title[:5]
Out[86]:
title
'burbs, The (1989) 1.107760
10 Things I Hate About You (1999) 0.989815
101 Dalmatians (1961) 0.982103
101 Dalmatians (1996) 1.098717
12 Angry Men (1957) 0.812731
Name: rating, dtype: float64
附录:
df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6]})
df
foo bar baz
0 one A 1
1 one B 2
2 one C 3
3 two A 4
4 two B 5
5 two C 6
df.pivot(index='foo', columns='bar', values='baz')
A B C
one 1 2 3
two 4 5 6
说明:最后那个方差排序使用order方法出了点问题
Never mind happiness, do your mission.