ch02-MovieLens-1M数据集

MovieLens是一组从20世纪90年代末到21世纪初的由MovieLens用户提供的电影评分数据。这些数据其中包括了电影评分、电影元数据(类型风格和年代)以及关于用户的人口统计学数据(年龄、邮编、性别和职业)。基于机器学习算法的推荐系统一般都会对此类数据感兴趣,这里将会告诉读者如何对数据进行切片切块以满足实际需求。

内容提要

  1. 载入pandas格式数据,并将不同文件的数据基于相同列进行合并
  2. 运用pandas对用户评分求平均值(基于不同的电影),比较性别差异(对电影的偏好),同一电影评分的分歧性
  3. 附录:pandas中整合数据常用pivot方法例程

1.载入数据

import pandas as pd
path = 'C:\\...\\pydata-book-1st-edition\\ch02\\movielens'
spl = '/'
path = spl.join(path.split('\\'))

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table(path+'/users.dat', sep='::', header = None, names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(path+'/ratings.dat', sep='::', header=None, names = rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(path+'/movies.dat', sep = '::', header=None, names = mnames)

首先分别从3个文件中读取数据,并存放于users, ratings, movies三个pandas的table格式中
同样的,数据可以从该书的github网页上获取:https://github.com/wesm/pydata-book/tree/1st-edition/ch02/movielens

数据呈现如下:

movies[:5]
Out[60]: 
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy
ratings[:5]
Out[61]: 
   user_id  movie_id  rating  timestamp
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291
users[:5]
Out[62]: 
   user_id gender  age  occupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455

不难想象,如果同时处理三个文件,工作量一定比处理一个合并的文件大,而碰巧我们发现这三个表有相同的列(users和ratings都有user_id;movies和ratings都有movie_id)就可以通过pandas自带的merge方法来合并。
否则会出现MergeError: No common columns to perform merge on

data = pd.merge(pd.merge(ratings, users), movies)

得到如下data

data[:5]
Out[65]: 
   user_id  movie_id  rating  timestamp gender  age  occupation    zip  \
0        1      1193       5  978300760      F    1          10  48067   
1        2      1193       5  978298413      M   56          16  70072   
2       12      1193       4  978220179      M   25          12  32793   
3       15      1193       4  978199279      M   25           7  22903   
4       17      1193       5  978158471      M   50           1  95350   

                                    title genres  
0  One Flew Over the Cuckoo's Nest (1975)  Drama  
1  One Flew Over the Cuckoo's Nest (1975)  Drama  
2  One Flew Over the Cuckoo's Nest (1975)  Drama  
3  One Flew Over the Cuckoo's Nest (1975)  Drama  
4  One Flew Over the Cuckoo's Nest (1975)  Drama 

2.数据进行处理

1)分别对两种性别得出其评分均值

mean_ratings = data.pivot_table(values='rating', index=['title'],
                             columns=['gender'], aggfunc=np.mean)

关于该函数的具体用法可参考附录1

得到如下:

mean_ratings[:5]
Out[70]: 
gender                                F         M
title                                            
$1,000,000 Duck (1971)         3.375000  2.761905
'Night Mother (1986)           3.388889  3.352941
'Til There Was You (1997)      2.675676  2.733333
'burbs, The (1989)             2.793478  2.962085
...And Justice for All (1979)  3.828571  3.689024

2)剔除评分数量少于250的电影

首先求出每种电影评分人数,这通过size函数可以比较容易获得

ratings_by_title = data.groupby('title').size()

再使用Python常用的列表功能就可以筛选出这些电影来

active_titles = ratings_by_title.index[ratings_by_title >= 250]

之后,在mean_ratings中把这些电影选出来(注意到我们前面是重新建立了一个pandas数据类型)

active_titles = ratings_by_title.index[ratings_by_title >= 250]

得到如下

active_titles[:5]
Out[77]: 
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)',
       '12 Angry Men (1957)'],
      dtype='object', name='title')

3)得到女性最喜欢的电影

top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)

得到如下

top_female_ratings[:5]
Out[78]: 
gender                                                     F         M
title                                                                 
Close Shave, A (1995)                               4.644444  4.473795
Wrong Trousers, The (1993)                          4.588235  4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.572650  4.464589
Wallace & Gromit: The Best of Aardman Animation...  4.563107  4.385075
Schindler's List (1993)    

5)得到男女评分差异最大电影

mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']
sorted_by_diff = mean_ratings.sort_index(by='diff')

得到如下

sorted_by_diff[:5]
Out[79]: 
gender                            F         M      diff
title                                                  
Dirty Dancing (1987)       3.790378  2.959596 -0.830782
Jumpin' Jack Flash (1986)  3.254717  2.578358 -0.676359
Grease (1978)              3.975265  3.367041 -0.608224
Little Women (1994)        3.870588  3.321739 -0.548849
Steel Magnolias (1989)     3.901734  3.365957 -0.535777

若希望得到反序的序列

sorted_by_diff[::-1][:15]
Out[80]: 
gender                                         F         M      diff
title                                                               
Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
Kentucky Fried Movie, The (1977)        2.878788  3.555147  0.676359
Dumb & Dumber (1994)                    2.697987  3.336595  0.638608
Longest Day, The (1962)                 3.411765  4.031447  0.619682
Cable Guy, The (1996)                   2.250000  2.863787  0.613787
Evil Dead II (Dead By Dawn) (1987)      3.297297  3.909283  0.611985
Hidden, The (1987)                      3.137931  3.745098  0.607167
Rocky III (1982)                        2.361702  2.943503  0.581801
Caddyshack (1980)                       3.396135  3.969737  0.573602
For a Few Dollars More (1965)           3.409091  3.953795  0.544704
Porky's (1981)                          2.296875  2.836364  0.539489
Animal House (1978)                     3.628906  4.167192  0.538286
Exorcist, The (1973)                    3.537634  4.067239  0.529605
Fright Night (1985)                     2.973684  3.500000  0.526316
Barb Wire (1996)                        1.585366  2.100386  0.515020

计算歧义最大电影(方差)

rating_std_by_title = data.groupby('title')['rating'].std()
rating_std_by_title = rating_std_by_title.ix[active_titles]

得到如下

rating_std_by_title[:5]
Out[86]: 
title
'burbs, The (1989)                   1.107760
10 Things I Hate About You (1999)    0.989815
101 Dalmatians (1961)                0.982103
101 Dalmatians (1996)                1.098717
12 Angry Men (1957)                  0.812731
Name: rating, dtype: float64

附录:

1. data.pivot用法

df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
                           'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                           'baz': [1, 2, 3, 4, 5, 6]})
df
        foo   bar  baz
    0   one   A    1
    1   one   B    2
    2   one   C    3
    3   two   A    4
    4   two   B    5
    5   two   C    6

df.pivot(index='foo', columns='bar', values='baz')
         A   B   C
    one  1   2   3
    two  4   5   6

说明:最后那个方差排序使用order方法出了点问题


Never mind happiness, do your mission.

你可能感兴趣的:(利用Python进行数据分析)