说明:本文章为Python数据处理学习日志,主要内容来自书本《利用Python进行数据分析》,Wes McKinney著,机械工业出版社。
所需文件在Day2中下载,接下来要用到的一些文件的文件格式如下:
users.dat文件格式
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
ratings.dat文件格式
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
movies.dat文件格式
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
通过pandas.read_table将各个表分别读到pandas DataFrame对象中:
import pandas as pd
import os
path='E:\\Enthought\\book\\ch02\\movielens'
os.chdir(path) #改变当前工作目录到path
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('users.dat',sep='::',header=None,names=unames) #根据'::'分解记录
-c:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
rnames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table('ratings.dat',sep='::',header=None,names=rnames,engine='python') #加了engine='python'就不会出现上述报错
mnames = ['movie_id','title','genres']
movies = pd.read_table('movies.dat',sep='::',header=None,names=mnames,engine='python')
查看各个DataFrame对象:
users[:5]
Out[11]:
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
ratings[:5]
Out[12]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
movies[:5]
Out[13]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
其中年龄age,职业occupation是以编码形式给出,具体含义参见README。
接下来尝试分析散布在三个表中的数据。假设我们想根据性别和年龄计算某部电影的平均得分,如果将所有数据合并到一个表的话问题就简单多了。我们先用pandas的merge函数将ratings跟users 合并到一起,然后再将movies野合并进去。pandas会根据列明的重叠情况推断出哪些是合并(或连接)键:
data = pd.merge(pd.merge(ratings,users),movies)
data[:5] #可能输merge策略改变,接下来两个输出结果均与书本不同
Out[16]:
user_id movie_id rating timestamp gender age occupation zip \
0 1 1193 5 978300760 F 1 10 48067
1 2 1193 5 978298413 M 56 16 70072
2 12 1193 4 978220179 M 25 12 32793
3 15 1193 4 978199279 M 25 7 22903
4 17 1193 5 978158471 M 50 1 95350
title genres
0 One Flew Over the Cuckoo's Nest (1975) Drama
1 One Flew Over the Cuckoo's Nest (1975) Drama
2 One Flew Over the Cuckoo's Nest (1975) Drama
3 One Flew Over the Cuckoo's Nest (1975) Drama
4 One Flew Over the Cuckoo's Nest (1975) Drama
data.ix[0] #输出第一条记录
Out[17]:
user_id 1
movie_id 1193
rating 5
timestamp 978300760
gender F
age 1
occupation 10
zip 48067
title One Flew Over the Cuckoo's Nest (1975)
genres Drama
Name: 0, dtype: object
接下来就可以根据任意个用户或者电影属性对评分数据进行聚合操作。按性别计算每部电影的平均分,可以使用pivot_table方法:
mean_ratings = data.pivot_table('rating',index='title',columns='gender',aggfunc='mean') #参数改变rows-index,cols-columns,与书本不一样
mean_ratings[:5]
Out[26]:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
该操作产生一个DataFrame,其内容为电影平均分,行标为电影名称,列标为性别。现在,过滤掉评分数据不够250条的电影。先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象:
ratings_by_title = data.groupby('title').size()
ratings_by_title[:10]
Out[28]:
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70 'Til There Was You (1997) 52
'burbs, The (1989) 303 ...And Justice for All (1979) 199 1-900 (1994) 2 10 Things I Hate About You (1999) 700 101 Dalmatians (1961) 565 101 Dalmatians (1996) 364 12 Angry Men (1957) 616 dtype: int64 active_titles = ratings_by_title.index[ratings_by_title>=250] active_titles Out[31]: Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
u'2001: A Space Odyssey (1968)', u'2010 (1984)',
...
u'X-Men (2000)', u'Year of Living Dangerously (1982)',
u'Yellow Submarine (1968)', u'You've Got Mail (1998)', u'Young Frankenstein (1974)', u'Young Guns (1988)', u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)', u'Zero Effect (1998)', u'eXistenZ (1999)'], dtype='object', name=u'title', length=1216)
该索引中含有评分数据大于250条的电影名称,然后就可以据此从前面的mean_ratings中选取所需的行了:
mean_ratings = mean_ratings.ix[active_titles]
mean_ratings[:5] #此处与书本不同
Out[34]:
gender F M
title
'burbs, The (1989) 2.793478 2.962085
10 Things I Hate About You (1999) 3.646552 3.311966
101 Dalmatians (1961) 3.791444 3.500000
101 Dalmatians (1996) 3.240000 2.911215
12 Angry Men (1957) 4.184397 4.328421
为了了解女性观众最喜欢的电影,可以对F列降序排列:
top_female_ratings = mean_ratings.sort_index(by='F',ascending=False)
-c:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
#此处出现警告,pandas0.18.1版本sort_index没有by参数,具体见下
top_female_ratings = mean_ratings.sort_values(by='F',ascending=False)
top_female_ratings[:10]
Out[38]:
gender F M
title
Close Shave, A (1995) 4.644444 4.473795
Wrong Trousers, The (1993) 4.588235 4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
Wallace & Gromit: The Best of Aardman Animation... 4.563107 4.385075
Schindler's List (1993) 4.562602 4.491415 Shawshank Redemption, The (1994) 4.539075 4.560625 Grand Day Out, A (1992) 4.537879 4.293255 To Kill a Mockingbird (1962) 4.536667 4.372611 Creature Comforts (1990) 4.513889 4.272277 Usual Suspects, The (1995) 4.513317 4.518248
警告函数比较,pandas版本0.18.1
pandas.DataFrame.sort_index()
Parameters:
axis : index, columns to direct sorting
level : int or level name or list of ints or list of level names
if not None, sort on values in specified index level(s)
ascending : boolean, default True
Sort ascending vs. descending
inplace : bool, if True, perform operation in-place
kind : {quicksort, mergesort, heapsort}
Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.
na_position : {‘first’, ‘last’}
first puts NaNs at the beginning, last puts NaNs at the end
sort_remaining : bool
if true and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level
Returns:
sorted_obj : DataFramepandas.DataFrame.sort_values()
Parameters:
by : string name or list of names which refer to the axis items
axis : index, columns to direct sorting
ascending : bool or list of bool
Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
inplace : bool
if True, perform operation in-place
kind : {quicksort, mergesort, heapsort}
Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.
na_position : {‘first’, ‘last’}
first puts NaNs at the beginning, last puts NaNs at the end
Returns:
sorted_obj : DataFrame
计算评分分歧
假设我们想要找出男性和女性观众分歧最大的电影。一个办法师给mean_ratings加上一个用于存放平均得分之差的列diff,并对其进行排序可得到分歧最大且女性观众更喜欢的电影:
mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']
sort_by_diff = mean_ratings.sort_values(by='diff')
sort_by_diff[:5]
Out[41]:
gender F M diff
title
Dirty Dancing (1987) 3.790378 2.959596 -0.830782
Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359 Grease (1978) 3.975265 3.367041 -0.608224 Little Women (1994) 3.870588 3.321739 -0.548849 Steel Magnolias (1989) 3.901734 3.365957 -0.535777
堆排序结果反序并取前5行,得到的则是男性观众更喜爱的电影:
sort_by_diff[::-1][:5]
Out[43]:
gender F M diff
title
Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
Dumb & Dumber (1994) 2.697987 3.336595 0.638608
Longest Day, The (1962) 3.411765 4.031447 0.619682
Cable Guy, The (1996) 2.250000 2.863787 0.613787
如果只想要找出分歧最大的电影(不考虑性别因素),则可以计算得分数据的方差或者标准差:
#分组后计算标准差
rating_std_by_title = data.groupby('title')['rating'].std()
#筛选评分多于250条的
rating_std_by_title = rating_std_by_title.ix[active_titles]
rating_std_by_title.order(ascending=False)[:5]
-c:1: FutureWarning: order is deprecated, use sort_values(...) #虽有警告,依然能得出结果
rating_std_by_title.sort_values(ascending=False)[:5]
Out[50]:
title
Dumb & Dumber (1994) 1.321333
Blair Witch Project, The (1999) 1.316368
Natural Born Killers (1994) 1.307198
Tank Girl (1995) 1.277695
Rocky Horror Picture Show, The (1975) 1.260177
Name: rating, dtype: float64