数据集来源https://grouplens.org/datasets/movielens/1m/
代码区:
import pandas as pd
uname=['user_id','gender','age','occupation','zip']
users=pd.read_table(r'D:\demo1\ml-1m\users.dat',sep='::',header=None,names=uname,engine = 'python')
'''
sep : str, default ‘,’
指定分隔符。如果不指定参数,则会尝试使用逗号分隔。分隔符长于一个字符并且不是‘\s+’,
将使用python的语法分析器。并且忽略数据中的逗号。正则表达式例子:'\r\t'
header : int or list of ints, default ‘infer’指定行数用来作为列名,数据开始行数。
names : array-like, default None
用于结果的列名列表,如果数据文件中没有列标题行,就需要执行header=None。
engine解析器引擎使用。C引擎速度更快,而python引擎目前更加完善。除去警告
'''
rnames=['user_id','movie_id','rating','timestamp']
ratings=pd.read_table(r'D:\demo1\ml-1m\ratings.dat',sep='::',header=None,names=rnames,engine = 'python')
mname=['movie_id','title','genres']
movies=pd.read_table(r'D:\demo1\ml-1m\movies.dat',sep='::',header=None,names=mname,engine = 'python')
data=pd.merge(pd.merge(movies,ratings),users)
print data.loc[0]#ix[0]已经deprecated弃用
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
user_id 1
rating 5
timestamp 978824268
gender F
age 1
occupation 10
zip 48067
'''
#枢轴表pandas.pivot_table(data, values=None,
index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
'''
mean_ratings=data.pivot_table('rating',index='title',columns='gender',aggfunc='mean')
print mean_ratings[:5]
result:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
#过滤数据不足200条的电影
ratings_groupby_title=data.groupby('title').size()
print ratings_groupby_title[:5]
reslut:
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
dtype: int64
active_titles=data.groupby('title').size().index[data.groupby('title').size()>=200]
print active_titles
result:
Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
u'2001: A Space Odyssey (1968)', u'2010 (1984)',
...
u'Year of Living Dangerously (1982)', u'Yellow Submarine (1968)',
u'Yojimbo (1961)', u'You've Got Mail (1998)',
u'Young Frankenstein (1974)', u'Young Guns (1988)',
u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)',
u'Zero Effect (1998)', u'eXistenZ (1999)'],
dtype='object', name=u'title', length=1426)
mean_ratings=mean_ratings.loc[active_titles]
#对F列进行降序
top_female_rating=mean_ratings.sort_values(by='F',ascending='False')
print top_female_rating[:10]
result:
gender F M
title
Battlefield Earth (2000) 1.574468 1.616949
Barb Wire (1996) 1.585366 2.100386
Showgirls (1995) 1.709091 2.166667
Jaws 3-D (1983) 1.863636 1.851064
Rocky V (1990) 1.878788 2.132780
Speed 2: Cruise Control (1997) 1.906667 1.863014
Avengers, The (1998) 1.915254 2.017467
Anaconda (1997) 2.000000 2.248447
Nightmare on Elm Street 5: The Dream Child, A (... 2.052632 1.981481
Howard the Duck (1986) 2.074627 2.103542
计算评分分歧
mean_ratings['diff']=mean_ratings['M']-mean_ratings['F']
sorted_by_diff=mean_ratings.sort_values(by='diff')
print sorted_by_diff[:5]
result:
gender F M
title
Dirty Dancing (1987) 3.790378 2.959596
To Wong Foo, Thanks for Everything! Julie Newma... 3.486842 2.795276
Jumpin' Jack Flash (1986) 3.254717 2.578358
Grease (1978) 3.975265 3.367041
Relic, The (1997) 3.309524 2.723077
gender diff
title
Dirty Dancing (1987) -0.830782
To Wong Foo, Thanks for Everything! Julie Newma... -0.691567
Jumpin' Jack Flash (1986) -0.676359
Grease (1978) -0.608224
Relic, The (1997) -0.586447
记一个笔记:脚本实现txt替换
#把文件内容替换
#把file3.txt 的 :: 替换为 ,,并保存到file4.txt
import re
fp3=open("file3.txt","r")
fp4=open("file4.txt","w")
for s in fp3.readlines():#先读出来
fp4.write(s.replace("::",",")) #替换 并写入
fp3.close()
fp4.close()