1、环境
Python2.7.13,安装了numpy,pandas等
import math
import random
from numpy import *
import numpy as np
import sys
import os
from pandas import Series,DataFrame
import pandas as pd
from sklearn.model_selection import train_test_split
2、读取数据集
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames,engine = 'python')
data = ratings.filter(regex='user_id|movie_id|rating')
read_table函数如果没添加最后一个参数的话,会出现如下的警告:
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.将得到的ratings过滤掉时间戳这一列得到data
3、查看数据
h10 = ratings.head(10)
print h10
# or print data.head(10)
查看前10个记录,显示:
查看一些统计值:
pd.options.display.float_format = '{:,.4f}'.format
print data.describe()
显示:
y = [i for i in range(1,1000210)]
train_X,test_X, train_y, test_y = train_test_split(data,y,test_size = 0.2,random_state=0)
20%作为测试集,80%作为训练集
5、求每部电影的平均评分
groupMovie = data.groupby('movie_id')
movieMean = groupMovie['rating'].agg('mean')
print movieMean[1193]