Python数据分析示例(2)Day3

说明:本文章为Python数据处理学习日志,主要内容来自书本《利用Python进行数据分析》,Wes McKinney著,机械工业出版社。

电影数据分析

所需文件在Day2中下载,接下来要用到的一些文件的文件格式如下:

users.dat文件格式
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117

ratings.dat文件格式
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968

movies.dat文件格式
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance

通过pandas.read_table将各个表分别读到pandas DataFrame对象中:

import pandas as pd
import os
path='E:\\Enthought\\book\\ch02\\movielens'
os.chdir(path) #改变当前工作目录到path

unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('users.dat',sep='::',header=None,names=unames) #根据'::'分解记录
-c:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.

rnames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table('ratings.dat',sep='::',header=None,names=rnames,engine='python') #加了engine='python'就不会出现上述报错

mnames = ['movie_id','title','genres']
movies = pd.read_table('movies.dat',sep='::',header=None,names=mnames,engine='python')

查看各个DataFrame对象:

users[:5]
Out[11]: 
   user_id gender  age  occupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455

ratings[:5]
Out[12]: 
   user_id  movie_id  rating  timestamp
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291

movies[:5]
Out[13]: 
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy

其中年龄age,职业occupation是以编码形式给出,具体含义参见README。
接下来尝试分析散布在三个表中的数据。假设我们想根据性别和年龄计算某部电影的平均得分,如果将所有数据合并到一个表的话问题就简单多了。我们先用pandas的merge函数将ratings跟users 合并到一起,然后再将movies野合并进去。pandas会根据列明的重叠情况推断出哪些是合并(或连接)键:

data = pd.merge(pd.merge(ratings,users),movies)
data[:5] #可能输merge策略改变,接下来两个输出结果均与书本不同
Out[16]: 
   user_id  movie_id  rating  timestamp gender  age  occupation    zip  \
0        1      1193       5  978300760      F    1          10  48067   
1        2      1193       5  978298413      M   56          16  70072   
2       12      1193       4  978220179      M   25          12  32793   
3       15      1193       4  978199279      M   25           7  22903   
4       17      1193       5  978158471      M   50           1  95350   

                                    title genres  
0  One Flew Over the Cuckoo's Nest (1975) Drama 
1  One Flew Over the Cuckoo's Nest (1975) Drama 
2  One Flew Over the Cuckoo's Nest (1975) Drama 
3  One Flew Over the Cuckoo's Nest (1975) Drama 
4  One Flew Over the Cuckoo's Nest (1975) Drama 

data.ix[0] #输出第一条记录
Out[17]: 
user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

接下来就可以根据任意个用户或者电影属性对评分数据进行聚合操作。按性别计算每部电影的平均分,可以使用pivot_table方法:

mean_ratings = data.pivot_table('rating',index='title',columns='gender',aggfunc='mean') #参数改变rows-index,cols-columns,与书本不一样
mean_ratings[:5]
Out[26]: 
gender                                F         M
title                                            
$1,000,000 Duck (1971)         3.375000  2.761905
'Night Mother (1986)           3.388889  3.352941
'Til There Was You (1997)      2.675676  2.733333
'burbs, The (1989)             2.793478  2.962085
...And Justice for All (1979)  3.828571  3.689024

该操作产生一个DataFrame,其内容为电影平均分,行标为电影名称,列标为性别。现在,过滤掉评分数据不够250条的电影。先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象:

ratings_by_title = data.groupby('title').size()
ratings_by_title[:10]
Out[28]: 
title
$1,000,000 Duck (1971)                37
'Night Mother (1986) 70 'Til There Was You (1997)             52
'burbs, The (1989) 303 ...And Justice for All (1979) 199 1-900 (1994) 2 10 Things I Hate About You (1999) 700 101 Dalmatians (1961) 565 101 Dalmatians (1996) 364 12 Angry Men (1957) 616 dtype: int64 active_titles = ratings_by_title.index[ratings_by_title>=250] active_titles Out[31]: Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
       u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
       u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
       u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
       u'2001: A Space Odyssey (1968)', u'2010 (1984)',
       ...
       u'X-Men (2000)', u'Year of Living Dangerously (1982)',
       u'Yellow Submarine (1968)', u'You've Got Mail (1998)', u'Young Frankenstein (1974)', u'Young Guns (1988)', u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)', u'Zero Effect (1998)', u'eXistenZ (1999)'], dtype='object', name=u'title', length=1216)

该索引中含有评分数据大于250条的电影名称,然后就可以据此从前面的mean_ratings中选取所需的行了:

mean_ratings = mean_ratings.ix[active_titles]
mean_ratings[:5] #此处与书本不同
Out[34]: 
gender                                    F         M
title                                                
'burbs, The (1989) 2.793478 2.962085
10 Things I Hate About You (1999)  3.646552  3.311966
101 Dalmatians (1961)              3.791444  3.500000
101 Dalmatians (1996)              3.240000  2.911215
12 Angry Men (1957)                4.184397  4.328421

为了了解女性观众最喜欢的电影,可以对F列降序排列:

top_female_ratings = mean_ratings.sort_index(by='F',ascending=False)
-c:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
#此处出现警告,pandas0.18.1版本sort_index没有by参数,具体见下
top_female_ratings = mean_ratings.sort_values(by='F',ascending=False)

top_female_ratings[:10]
Out[38]: 
gender                                                     F         M
title                                                                 
Close Shave, A (1995)                               4.644444  4.473795
Wrong Trousers, The (1993)                          4.588235  4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.572650  4.464589
Wallace & Gromit: The Best of Aardman Animation...  4.563107  4.385075
Schindler's List (1993) 4.562602 4.491415 Shawshank Redemption, The (1994) 4.539075 4.560625 Grand Day Out, A (1992) 4.537879 4.293255 To Kill a Mockingbird (1962) 4.536667 4.372611 Creature Comforts (1990) 4.513889 4.272277 Usual Suspects, The (1995) 4.513317 4.518248

警告函数比较,pandas版本0.18.1

pandas.DataFrame.sort_index()
Parameters:
axis : index, columns to direct sorting
level : int or level name or list of ints or list of level names
if not None, sort on values in specified index level(s)
ascending : boolean, default True
Sort ascending vs. descending
inplace : bool, if True, perform operation in-place
kind : {quicksort, mergesort, heapsort}
Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.
na_position : {‘first’, ‘last’}
first puts NaNs at the beginning, last puts NaNs at the end
sort_remaining : bool
if true and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level
Returns:
sorted_obj : DataFrame

pandas.DataFrame.sort_values()
Parameters:
by : string name or list of names which refer to the axis items
axis : index, columns to direct sorting
ascending : bool or list of bool
Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
inplace : bool
if True, perform operation in-place
kind : {quicksort, mergesort, heapsort}
Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.
na_position : {‘first’, ‘last’}
first puts NaNs at the beginning, last puts NaNs at the end
Returns:
sorted_obj : DataFrame

计算评分分歧
假设我们想要找出男性和女性观众分歧最大的电影。一个办法师给mean_ratings加上一个用于存放平均得分之差的列diff,并对其进行排序可得到分歧最大且女性观众更喜欢的电影:

mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']
sort_by_diff = mean_ratings.sort_values(by='diff')
sort_by_diff[:5]
Out[41]: 
gender                            F         M      diff
title                                                  
Dirty Dancing (1987)       3.790378  2.959596 -0.830782
Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359 Grease (1978) 3.975265 3.367041 -0.608224 Little Women (1994) 3.870588 3.321739 -0.548849 Steel Magnolias (1989) 3.901734 3.365957 -0.535777

堆排序结果反序并取前5行,得到的则是男性观众更喜爱的电影:

sort_by_diff[::-1][:5]
Out[43]: 
gender                                         F         M      diff
title                                                               
Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
Kentucky Fried Movie, The (1977)        2.878788  3.555147  0.676359
Dumb & Dumber (1994)                    2.697987  3.336595  0.638608
Longest Day, The (1962)                 3.411765  4.031447  0.619682
Cable Guy, The (1996)                   2.250000  2.863787  0.613787

如果只想要找出分歧最大的电影(不考虑性别因素),则可以计算得分数据的方差或者标准差:

#分组后计算标准差
rating_std_by_title = data.groupby('title')['rating'].std()
#筛选评分多于250条的
rating_std_by_title = rating_std_by_title.ix[active_titles]

rating_std_by_title.order(ascending=False)[:5]
-c:1: FutureWarning: order is deprecated, use sort_values(...) #虽有警告,依然能得出结果
rating_std_by_title.sort_values(ascending=False)[:5]
Out[50]: 
title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Name: rating, dtype: float64

你可能感兴趣的:(数据分析,python)