电影案例分析
- 前言
- 我们想知道这些电影数据中评分的平均分,导演的人数等信息,我们应该怎么获取?
- 对于这一组电影数据,如果我们想Rating,Runtime (Minutes)的分布情况,应该如何呈现数据?
-
- 对于这一组电影数据,如果我们希望统计电影分类(genre)的情况,应该如何处理数据?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
movie = pd.read_csv("./data/IMDB-Movie-Data.csv")
movie.head()
前言
数据来源:kaggle
|
Rank |
Title |
Genre |
Description |
Director |
Actors |
Year |
Runtime (Minutes) |
Rating |
Votes |
Revenue (Millions) |
Metascore |
0 |
1 |
Guardians of the Galaxy |
Action,Adventure,Sci-Fi |
A group of intergalactic criminals are forced ... |
James Gunn |
Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... |
2014 |
121 |
8.1 |
757074 |
333.13 |
76.0 |
1 |
2 |
Prometheus |
Adventure,Mystery,Sci-Fi |
Following clues to the origin of mankind, a te... |
Ridley Scott |
Noomi Rapace, Logan Marshall-Green, Michael Fa... |
2012 |
124 |
7.0 |
485820 |
126.46 |
65.0 |
2 |
3 |
Split |
Horror,Thriller |
Three girls are kidnapped by a man with a diag... |
M. Night Shyamalan |
James McAvoy, Anya Taylor-Joy, Haley Lu Richar... |
2016 |
117 |
7.3 |
157606 |
138.12 |
62.0 |
3 |
4 |
Sing |
Animation,Comedy,Family |
In a city of humanoid animals, a hustling thea... |
Christophe Lourdelet |
Matthew McConaughey,Reese Witherspoon, Seth Ma... |
2016 |
108 |
7.2 |
60545 |
270.32 |
59.0 |
4 |
5 |
Suicide Squad |
Action,Adventure,Fantasy |
A secret government agency recruits some of th... |
David Ayer |
Will Smith, Jared Leto, Margot Robbie, Viola D... |
2016 |
123 |
6.2 |
393727 |
325.02 |
40.0 |
我们想知道这些电影数据中评分的平均分,导演的人数等信息,我们应该怎么获取?
movie["Rating"].mean()
6.723199999999999
np.unique(movie["Director"]).shape[0]
644
对于这一组电影数据,如果我们想Rating,Runtime (Minutes)的分布情况,应该如何呈现数据?
用pandas画图
movie["Rating"].plot(kind="hist")
用matplotlib画图
plt.figure(figsize=(20, 8), dpi=100)
plt.hist(movie["Rating"].values, bins=20)
max_ = movie["Rating"].max()
min_ = movie["Rating"].min()
t1 = np.linspace(min_, max_, num=21)
plt.xticks(t1)
plt.grid()
plt.show()
plt.figure(figsize=(20, 8), dpi=100)
plt.hist(movie["Runtime (Minutes)"].values, bins=20)
max_ = movie["Runtime (Minutes)"].max()
min_ = movie["Runtime (Minutes)"].min()
t1 = np.linspace(min_, max_, num=21)
plt.xticks(t1)
plt.grid()
plt.show()
对于这一组电影数据,如果我们希望统计电影分类(genre)的情况,应该如何处理数据?
temp_list = [i.split(",") for i in movie["Genre"]]
genre_list = np.unique([i for j in temp_list for i in j])
genre_list
array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime',
'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music',
'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller',
'War', 'Western'], dtype='
zeros = np.zeros([movie.shape[0], genre_list.shape[0]])
temp_movie = pd.DataFrame(zeros, columns=genre_list)
temp_movie.head()
|
Action |
Adventure |
Animation |
Biography |
Comedy |
Crime |
Drama |
Family |
Fantasy |
History |
Horror |
Music |
Musical |
Mystery |
Romance |
Sci-Fi |
Sport |
Thriller |
War |
Western |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
3 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
4 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
for i in range(1000):
temp_movie.loc[i, temp_list[i]] = 1
genre = temp_movie.sum().sort_values(ascending=False)
genre
Drama 513.0
Action 303.0
Comedy 279.0
Adventure 259.0
Thriller 195.0
Crime 150.0
Romance 141.0
Sci-Fi 120.0
Horror 119.0
Mystery 106.0
Fantasy 101.0
Biography 81.0
Family 51.0
Animation 49.0
History 29.0
Sport 18.0
Music 16.0
War 13.0
Western 7.0
Musical 5.0
dtype: float64
genre.plot(kind="bar", colormap="cool", figsize=(20, 8), fontsize=16)