MovieLens数据集:包含表示多个用户对多部电影的10万次评级数据,也包含电影元数据和用户属性信息。
unzip ml-100k.zip
cd ml-100k
其中重要的文件有u.user(用户属性文件)、u.item(电影元数据)和u.data(用户对电影的评级)
head -5 u.user
head -5 u.item
head -5 u.data
通过IPython交互式终端和matplotlib库来对数据进行处理和可视化。需要安装Anaconda,选择python 2.7。
在启动PySpark终端是,我们可以使用IPython而非标准的Python shell。启动时向IPython传入其他参数,让它在启动时也启用pylab功能。
IPYTHON=1 IPYTHON_OPTS="--pylab" ./bin/pyspark
user_data = sc.textFile("PATH/ml-100k/u.user")
user_data.first()
其输出如下:
统计用户、性别、职业和邮编的数目
user_fields = user_data.map(lambda line: line.split("|"))
num_users = user_fields.map(lambda fields: fields[0]).count()
num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()
num_occupations = user_fields.map(lambda fields: fields[3]).distinct().count()
num_zipcodes = user_fields.map(lambda fields: fields[4]).distinct().count()
print "Users: %d, genders: %d, occupations: %d, ZIP codes: %d" % (num_users, num_genders, num_occupations, num_zipcodes)
输出如下:
接着用matplotlib的hist函数创建一个直方图,以分析用户年龄的分布情况:
ages = user_fields.map(lambda x: int(x[1])).collect()
hist(ages, bins=20, color='lightblue', normed=True) fig = matplotlib.pyplot.gcf() fig.set_size_inches(16, 10)
数据中对职业的描述用的是文本,所以需要对其稍作处理以便bar函数使用:
count_by_occupation = user_fields.map(lambda fields: (fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()
x_axis1 = np.array([c[0] for c in count_by_occupation])
y_axis1 = np.array([c[1] for c in count_by_occupation])
x_axis = x_axis1[np.argsort(y_axis1)]
y_axis = y_axis1[np.argsort(y_axis1)]
pos = np.arange(len(x_axis))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)
plt.bar(pos, y_axis, width, color='lightblue')
plt.xticks(rotation=30)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)
movie_data = sc.textFile("PATH/ml-100k/u.item")
print movie_data.first()
num_movies = movie_data.count()
print "Movies: %d" % num_movies
电影数据中有些数据不完整,故需要一个函数来处理解析release date时可能出现的解析错误,命名该函数为convert_year:
def convert_year(x):
try:
return int(x[-4:])
except:
return 1900
movie_fields = movie_data.map(lambda lines: lines.split("|"))
years = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x))
years_filtered = years.filter(lambda x: x != 1900)
movie_ages = years_filtered.map(lambda yr: 1998-yr).countByValue()
values = movie_ages.values()
bins = movie_ages.keys()
hist(values, bins=bins, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16,10)
rating_data_raw = sc.textFile("PATH/ml-100k/u.data")
print rating_data_raw.first()
num_ratings = rating_data_raw.count()
print "Ratings: %d" % num_ratings
评级次数共有10万。
rating_data = rating_data_raw.map(lambda line: line.split("\t"))
ratings = rating_data.map(lambda fields: int(fields[2]))
max_rating = ratings.reduce(lambda x, y: max(x, y))
min_rating = ratings.reduce(lambda x, y: min(x, y))
mean_rating = ratings.reduce(lambda x, y: x + y) / float(num_ratings)
median_rating = np.median(ratings.collect())
ratings_per_user = num_ratings / num_users
ratings_per_movie = num_ratings / num_movies
print "Min rating: %d" % min_rating
print "Max rating: %d" % max_rating
print "Average rating: %2.2f" % mean_rating
print "Median rating: %d" % median_rating
print "Average # of ratings per user: %2.2f" % ratings_per_user
print "Average # of ratings per movie: %2.2f" % ratings_per_movie
Spark对RDD也提供一个名为states的函数,该函数包含一个数值变量用于做类似的统计:
ratings.stats()
如下图所示:
count_by_rating = ratings.countByValue()
x_axis = np.array(count_by_rating.keys())
y_axis = np.array([float(c) for c in count_by_rating.values()])
# we normalize the y-axis here to percentages
y_axis_normed = y_axis / y_axis.sum()
pos = np.arange(len(x_axis))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)
plt.bar(pos, y_axis_normed, width, color='lightblue')
plt.xticks(rotation=30)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)
user_ratings_grouped = rating_data.map(lambda fields: (int(fields[0]), int(fields[2]))).groupByKey()
user_ratings_byuser = user_ratings_grouped.map(lambda (k, v): (k, len(v)))
user_ratings_byuser.take(5)
user_ratings_byuser_local = user_ratings_byuser.map(lambda (k, v): v).collect()
hist(user_ratings_byuser_local, bins=200, color='lightblue', normed=True) fig = matplotlib.pyplot.gcf() fig.set_size_inches(16,10)
将类别特征表示为数字形式,常可借助k之1(1-of-k)方法进行。
k之1编码:假设变量可取的值有k个,如果对这些值用1到k编序,则可以用长度为k的二元向量来表示一个变量的取值,在这个向量里,该取值对应的序号所在的元素为1,其他元素都为0.