利用python进入数据分析之MovieLens_1M数据分析

数据设置

In [26]:
import pandas as pd
import os
encoding = 'latin1'

upath = os.path.expanduser('ch02/movielens/users.dat')
rpath = os.path.expanduser('ch02/movielens/ratings.dat')
mpath = os.path.expanduser('ch02/movielens/movies.dat')

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
mnames = ['movie_id', 'title', 'genres']

users = pd.read_csv(upath, sep='::', header=None, names=unames, encoding=encoding)
ratings = pd.read_csv(rpath, sep='::', header=None, names=rnames, encoding=encoding)
movies = pd.read_csv(mpath, sep='::', header=None, names=mnames, encoding=encoding)
In [6]:
users[:5]
Out[6]:
  user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
In [7]:
ratings[:5]
Out[7]:
  user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
In [8]:
movies[:5]
Out[8]:
  movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
In [9]:
ratings
Out[9]:
  user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
5 1 1197 3 978302268
6 1 1287 5 978302039
7 1 2804 5 978300719
8 1 594 4 978302268
9 1 919 4 978301368
10 1 595 5 978824268
11 1 938 4 978301752
12 1 2398 4 978302281
13 1 2918 4 978302124
14 1 1035 5 978301753
15 1 2791 4 978302188
16 1 2687 3 978824268
17 1 2018 4 978301777
18 1 3105 5 978301713
19 1 2797 4 978302039
20 1 2321 3 978302205
21 1 720 3 978300760
22 1 1270 5 978300055
23 1 527 5 978824195
24 1 2340 3 978300103
25 1 48 5 978824351
26 1 1097 4 978301953
27 1 1721 4 978300055
28 1 1545 4 978824139
29 1 745 3 978824268
... ... ... ... ...
1000179 6040 2762 4 956704584
1000180 6040 1036 3 956715455
1000181 6040 508 4 956704972
1000182 6040 1041 4 957717678
1000183 6040 3735 4 960971654
1000184 6040 2791 4 956715569
1000185 6040 2794 1 956716438
1000186 6040 527 5 956704219
1000187 6040 2003 1 956716294
1000188 6040 535 4 964828734
1000189 6040 2010 5 957716795
1000190 6040 2011 4 956716113
1000191 6040 3751 4 964828782
1000192 6040 2019 5 956703977
1000193 6040 541 4 956715288
1000194 6040 1077 5 964828799
1000195 6040 1079 2 956715648
1000196 6040 549 4 956704746
1000197 6040 2020 3 956715288
1000198 6040 2021 3 956716374
1000199 6040 2022 5 956716207
1000200 6040 2028 5 956704519
1000201 6040 1080 4 957717322
1000202 6040 1089 4 956704996
1000203 6040 1090 3 956715518
1000204 6040 1091 1 956716541
1000205 6040 1094 5 956704887
1000206 6040 562 5 956704746
1000207 6040 1096 4 956715648
1000208 6040 1097 4 956715569

1000209 rows × 4 columns

数据合并

In [10]:
data = pd.merge(pd.merge(ratings, users), movies)
data
Out[10]:
  user_id movie_id rating timestamp gender age occupation zip title genres
0 1 1193 5 978300760 F 1 10 48067 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 1193 5 978298413 M 56 16 70072 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 1193 4 978220179 M 25 12 32793 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 1193 4 978199279 M 25 7 22903 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 1193 5 978158471 M 50 1 95350 One Flew Over the Cuckoo's Nest (1975) Drama
5 18 1193 4 978156168 F 18 3 95825 One Flew Over the Cuckoo's Nest (1975) Drama
6 19 1193 5 982730936 M 1 10 48073 One Flew Over the Cuckoo's Nest (1975) Drama
7 24 1193 5 978136709 F 25 7 10023 One Flew Over the Cuckoo's Nest (1975) Drama
8 28 1193 3 978125194 F 25 1 14607 One Flew Over the Cuckoo's Nest (1975) Drama
9 33 1193 5 978557765 M 45 3 55421 One Flew Over the Cuckoo's Nest (1975) Drama
10 39 1193 5 978043535 M 18 4 61820 One Flew Over the Cuckoo's Nest (1975) Drama
11 42 1193 3 978038981 M 25 8 24502 One Flew Over the Cuckoo's Nest (1975) Drama
12 44 1193 4 978018995 M 45 17 98052 One Flew Over the Cuckoo's Nest (1975) Drama
13 47 1193 4 977978345 M 18 4 94305 One Flew Over the Cuckoo's Nest (1975) Drama
14 48 1193 4 977975061 M 25 4 92107 One Flew Over the Cuckoo's Nest (1975) Drama
15 49 1193 4 978813972 M 18 12 77084 One Flew Over the Cuckoo's Nest (1975) Drama
16 53 1193 5 977946400 M 25 0 96931 One Flew Over the Cuckoo's Nest (1975) Drama
17 54 1193 5 977944039 M 50 1 56723 One Flew Over the Cuckoo's Nest (1975) Drama
18 58 1193 5 977933866 M 25 2 30303 One Flew Over the Cuckoo's Nest (1975) Drama
19 59 1193 4 977934292 F 50 1 55413 One Flew Over the Cuckoo's Nest (1975) Drama
20 62 1193 4 977968584 F 35 3 98105 One Flew Over the Cuckoo's Nest (1975) Drama
21 80 1193 4 977786172 M 56 1 49327 One Flew Over the Cuckoo's Nest (1975) Drama
22 81 1193 5 977785864 F 25 0 60640 One Flew Over the Cuckoo's Nest (1975) Drama
23 88 1193 5 977694161 F 45 1 02476 One Flew Over the Cuckoo's Nest (1975) Drama
24 89 1193 5 977683596 F 56 9 85749 One Flew Over the Cuckoo's Nest (1975) Drama
25 95 1193 5 977626632 M 45 0 98201 One Flew Over the Cuckoo's Nest (1975) Drama
26 96 1193 3 977621789 F 25 16 78028 One Flew Over the Cuckoo's Nest (1975) Drama
27 99 1193 2 982791053 F 1 10 19390 One Flew Over the Cuckoo's Nest (1975) Drama
28 102 1193 5 1040737607 M 35 19 20871 One Flew Over the Cuckoo's Nest (1975) Drama
29 104 1193 2 977546620 M 25 12 00926 One Flew Over the Cuckoo's Nest (1975) Drama
... ... ... ... ... ... ... ... ... ... ...
1000179 4933 3084 3 962757020 M 25 15 94040 Home Page (1999) Documentary
1000180 4802 2218 2 1014866656 M 56 1 40601 Juno and Paycock (1930) Drama
1000181 4812 2308 2 962932391 M 18 14 25301 Detroit 9000 (1973) Action|Crime
1000182 4874 624 4 962781918 F 25 4 70808 Condition Red (1995) Action|Drama|Thriller
1000183 5059 1434 4 962484364 M 45 16 22652 Stranger, The (1994) Action
1000184 5947 1434 4 957190428 F 45 16 97215 Stranger, The (1994) Action
1000185 5077 1868 3 962417299 M 25 2 20037 Truce, The (1996) Drama|War
1000186 5944 1868 1 957197520 F 18 10 27606 Truce, The (1996) Drama|War
1000187 5105 404 3 962337582 M 50 7 18977 Brother Minister: The Assassination of Malcolm... Documentary
1000188 5185 404 4 963402617 F 35 4 44485 Brother Minister: The Assassination of Malcolm... Documentary
1000189 5532 404 5 959619841 M 25 17 27408 Brother Minister: The Assassination of Malcolm... Documentary
1000190 5543 404 3 960127592 M 25 17 97401 Brother Minister: The Assassination of Malcolm... Documentary
1000191 5220 2543 3 961546137 M 25 7 91436 Six Ways to Sunday (1997) Comedy
1000192 5754 2543 4 958272316 F 18 1 60640 Six Ways to Sunday (1997) Comedy
1000193 5227 591 3 961475931 M 18 10 64050 Tough and Deadly (1995) Action|Drama|Thriller
1000194 5795 591 1 958145253 M 25 1 92688 Tough and Deadly (1995) Action|Drama|Thriller
1000195 5313 3656 5 960920392 M 56 0 55406 Lured (1947) Crime
1000196 5328 2438 4 960838075 F 25 4 91740 Outside Ozona (1998) Drama|Thriller
1000197 5334 3323 3 960796159 F 56 13 46140 Chain of Fools (2000) Comedy|Crime
1000198 5334 127 1 960795494 F 56 13 46140 Silence of the Palace, The (Saimt el Qusur) (1... Drama
1000199 5334 3382 5 960796159 F 56 13 46140 Song of Freedom (1936) Drama
1000200 5420 1843 3 960156505 F 1 19 14850 Slappy and the Stinkers (1998) Children's|Comedy
1000201 5433 286 3 960240881 F 35 17 45014 Nemesis 2: Nebula (1995) Action|Sci-Fi|Thriller
1000202 5494 3530 4 959816296 F 35 17 94306 Smoking/No Smoking (1993) Comedy
1000203 5556 2198 3 959445515 M 45 6 92103 Modulations (1998) Documentary
1000204 5949 2198 5 958846401 M 18 17 47901 Modulations (1998) Documentary
1000205 5675 2703 3 976029116 M 35 14 30030 Broken Vessels (1998) Drama
1000206 5780 2845 1 958153068 M 18 17 92886 White Boys (1999) Drama
1000207 5851 3607 5 957756608 F 18 20 55410 One Little Indian (1973) Comedy|Drama|Western
1000208 5938 2909 4 957273353 M 25 1 35401 Five Wives, Three Secretaries and Me (1998) Documentary

1000209 rows × 10 columns

In [11]:
data.ix[0]
D:\python2713\lib\anaconda_install\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  """Entry point for launching an IPython kernel.
Out[11]:
user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

计算电影平均分

In [34]:
import sys
reload(sys)
sys.setdefaultencoding('latin1')
mean_ratings = data.pivot_table('rating', index='title',columns='gender', aggfunc='mean')
In [38]:
mean_ratings[:5]
Out[38]:
gender F M
title    
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
In [39]:
ratings_by_title = data.groupby('title').size()  #对title进行分组
In [40]:
ratings_by_title[:5]
Out[40]:
title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64
In [41]:
active_titles = ratings_by_title.index[ratings_by_title >= 250] # 获得评论数据大于250的电影
In [42]:
active_titles[:10]
Out[42]:
Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
       u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
       u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
       u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
       u'2001: A Space Odyssey (1968)', u'2010 (1984)'],
      dtype='object', name=u'title')
In [43]:
mean_ratings = mean_ratings.ix[active_titles]
mean_ratings
Out[43]:
gender F M
title    
'burbs, The (1989) 2.793478 2.962085
10 Things I Hate About You (1999) 3.646552 3.311966
101 Dalmatians (1961) 3.791444 3.500000
101 Dalmatians (1996) 3.240000 2.911215
12 Angry Men (1957) 4.184397 4.328421
13th Warrior, The (1999) 3.112000 3.168000
2 Days in the Valley (1996) 3.488889 3.244813
20,000 Leagues Under the Sea (1954) 3.670103 3.709205
2001: A Space Odyssey (1968) 3.825581 4.129738
2010 (1984) 3.446809 3.413712
28 Days (2000) 3.209424 2.977707
39 Steps, The (1935) 3.965517 4.107692
54 (1998) 2.701754 2.782178
7th Voyage of Sinbad, The (1958) 3.409091 3.658879
8MM (1999) 2.906250 2.850962
About Last Night... (1986) 3.188679 3.140909
Absent Minded Professor, The (1961) 3.469388 3.446809
Absolute Power (1997) 3.469136 3.327759
Abyss, The (1989) 3.659236 3.689507
Ace Ventura: Pet Detective (1994) 3.000000 3.197917
Ace Ventura: When Nature Calls (1995) 2.269663 2.543333
Addams Family Values (1993) 3.000000 2.878531
Addams Family, The (1991) 3.186170 3.163498
Adventures in Babysitting (1987) 3.455782 3.208122
Adventures of Buckaroo Bonzai Across the 8th Dimension, The (1984) 3.308511 3.402321
Adventures of Priscilla, Queen of the Desert, The (1994) 3.989071 3.688811
Adventures of Robin Hood, The (1938) 4.166667 3.918367
African Queen, The (1951) 4.324232 4.223822
Age of Innocence, The (1993) 3.827068 3.339506
Agnes of God (1985) 3.534884 3.244898
... ... ...
White Men Can't Jump (1992) 3.028777 3.231061
Who Framed Roger Rabbit? (1988) 3.569378 3.713251
Who's Afraid of Virginia Woolf? (1966) 4.029703 4.096939
Whole Nine Yards, The (2000) 3.296552 3.404814
Wild Bunch, The (1969) 3.636364 4.128099
Wild Things (1998) 3.392000 3.459082
Wild Wild West (1999) 2.275449 2.131973
William Shakespeare's Romeo and Juliet (1996) 3.532609 3.318644
Willow (1988) 3.658683 3.453543
Willy Wonka and the Chocolate Factory (1971) 4.063953 3.789474
Witness (1985) 4.115854 3.941504
Wizard of Oz, The (1939) 4.355030 4.203138
Wolf (1994) 3.074074 2.899083
Women on the Verge of a Nervous Breakdown (1988) 3.934307 3.865741
Wonder Boys (2000) 4.043796 3.913649
Working Girl (1988) 3.606742 3.312500
World Is Not Enough, The (1999) 3.337500 3.388889
Wrong Trousers, The (1993) 4.588235 4.478261
Wyatt Earp (1994) 3.147059 3.283898
X-Files: Fight the Future, The (1998) 3.489474 3.493797
X-Men (2000) 3.682310 3.851702
Year of Living Dangerously (1982) 3.951220 3.869403
Yellow Submarine (1968) 3.714286 3.689286
You've Got Mail (1998) 3.542424 3.275591
Young Frankenstein (1974) 4.289963 4.239177
Young Guns (1988) 3.371795 3.425620
Young Guns II (1990) 2.934783 2.904025
Young Sherlock Holmes (1985) 3.514706 3.363344
Zero Effect (1998) 3.864407 3.723140
eXistenZ (1999) 3.098592 3.289086

1216 rows × 2 columns

In [44]:
mean_ratings = mean_ratings.rename(index={'Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)':
                           'Seven Samurai (Shichinin no samurai) (1954)'})
In [45]:
top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)# 获取女性观众最喜欢的电影
top_female_ratings[:10]
Out[45]:
gender F M
title    
Close Shave, A (1995) 4.644444 4.473795
Wrong Trousers, The (1993) 4.588235 4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075
Schindler's List (1993) 4.562602 4.491415
Shawshank Redemption, The (1994) 4.539075 4.560625
Grand Day Out, A (1992) 4.537879 4.293255
To Kill a Mockingbird (1962) 4.536667 4.372611
Creature Comforts (1990) 4.513889 4.272277
Usual Suspects, The (1995) 4.513317 4.518248

你可能感兴趣的:(python,机器学习,利用python进行数据分析)