在本项目中,将分析一个TMDb(The Movie Database:美国电影数据集),然后传达我的发现。将使用 Python 库 NumPy、Pandas 和 Matplotlib 来使帮助我进行分析。
本数据集中包含 10,000 条电影信息,信息来源为“电影数据库”(TMDb,The Movie Database),包括用户评分和票房。
“演职人员 (cast)”、“电影类别 (genres)”等数据列包含由竖线字符(|)分隔的多个数值。“演职人员 (cast) ”列中有一些奇怪的字符。以“_adj”结尾的最后两列表示了考虑了通货膨胀之后的相关电影的预算和收入(以2010年美元的价值来计算)。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df_tmdb = pd.read_csv('tmdb-movies.csv')
id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | ... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
3 rows × 21 columns
id 0
imdb_id 10
popularity 0
budget 0
revenue 0
original_title 0
cast 76
homepage 7929
director 44
tagline 2824
keywords 1493
overview 4
runtime 0
genres 23
production_companies 1030
release_date 0
vote_count 0
vote_average 0
release_year 0
budget_adj 0
revenue_adj 0
dtype: int64
id | popularity | budget | revenue | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
count | 10865.000000 | 10865.000000 | 1.086500e+04 | 1.086500e+04 | 10865.000000 | 10865.000000 | 10865.000000 | 10865.000000 | 1.086500e+04 | 1.086500e+04 |
mean | 66066.374413 | 0.646446 | 1.462429e+07 | 3.982690e+07 | 102.071790 | 217.399632 | 5.975012 | 2001.321859 | 1.754989e+07 | 5.136900e+07 |
std | 92134.091971 | 1.000231 | 3.091428e+07 | 1.170083e+08 | 31.382701 | 575.644627 | 0.935138 | 12.813260 | 3.430753e+07 | 1.446383e+08 |
min | 5.000000 | 0.000065 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 10.000000 | 1.500000 | 1960.000000 | 0.000000e+00 | 0.000000e+00 |
25% | 10596.000000 | 0.207575 | 0.000000e+00 | 0.000000e+00 | 90.000000 | 17.000000 | 5.400000 | 1995.000000 | 0.000000e+00 | 0.000000e+00 |
50% | 20662.000000 | 0.383831 | 0.000000e+00 | 0.000000e+00 | 99.000000 | 38.000000 | 6.000000 | 2006.000000 | 0.000000e+00 | 0.000000e+00 |
75% | 75612.000000 | 0.713857 | 1.500000e+07 | 2.400000e+07 | 111.000000 | 146.000000 | 6.600000 | 2011.000000 | 2.085325e+07 | 3.370173e+07 |
max | 417859.000000 | 32.985763 | 4.250000e+08 | 2.781506e+09 | 900.000000 | 9767.000000 | 9.200000 | 2015.000000 | 4.250000e+08 | 2.827124e+09 |
df_tmdb['genres'] = df_tmdb['genres'].fillna("NaN")
df_index = df_tmdb[df_tmdb.genres == 'NaN'].index.tolist()
df_tmdb = df_tmdb.drop(df_index)
(10842, 21)
df_tmdb = df_tmdb.drop('genres', axis = 1).join(df_tmdb['genres'].str.split('|', expand = True).stack().reset_index(level = 1, drop = True).rename('genres'))
id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | genres | |
0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 | Action |
0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 | Adventure |
0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 | Science Fiction |
0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 | Thriller |
1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 | Action |
5 rows × 21 columns
df_tmdb = df_tmdb[df_tmdb['budget_adj'] > 0]
df_tmdb = df_tmdb[df_tmdb['popularity'] > 0]
Drama 2316
Comedy 1740
Thriller 1641
Action 1428
Adventure 906
Romance 861
Crime 823
Horror 765
Science Fiction 701
Family 523
Fantasy 508
Mystery 440
Animation 260
History 183
Music 169
War 155
Western 74
Documentary 64
Foreign 35
TV Movie 9
Name: genres, dtype: int64
# 以“genres”分组,计算各种体裁对应的budget平均值,按降序排列
budget_mean = df_tmdb.groupby('genres')['budget_adj'].mean().sort_values(ascending=False)
Adventure 7.133755e+07
Animation 6.800557e+07
Fantasy 6.749065e+07
Family 6.337153e+07
Action 5.502584e+07
Western 5.462267e+07
Science Fiction 5.176227e+07
War 5.039431e+07
History 4.847202e+07
Thriller 3.663947e+07
Mystery 3.586516e+07
Crime 3.542695e+07
Comedy 3.470445e+07
Music 3.135773e+07
Romance 3.113657e+07
Drama 3.052799e+07
Horror 1.661574e+07
Foreign 1.277944e+07
TV Movie 5.492844e+06
Documentary 5.063684e+06
Name: budget_adj, dtype: float64
budget_mean.plot(kind = 'bar',
color = 'grey')
plt.title('Budget of different kinds of Genres');
popularity_mean = df_tmdb.groupby(['genres'])['popularity'].mean().sort_values(ascending=False)
most_popularity = popularity_mean[:5].index.tolist()
['Adventure', 'Science Fiction', 'Fantasy', 'Animation', 'Action']
结论2-1:最受欢迎(popularity)前五的电影类型是Adventure,Science Fiction,Fantasy,Action,Animation。
df_popularity = df_tmdb[['genres', 'release_year', 'popularity']].set_index('genres').loc[most_popularity].reset_index('genres')
df_popularity_year = pd.pivot_table(df_popularity, values='popularity', index='release_year', columns='genres')
df_popularity_year.plot(kind='line',subplots=True, sharex=True, sharey=True, figsize=(20,20));
# 得到vote_average均值最大的体裁类型
df_vote = df_tmdb[['genres', 'vote_average', 'popularity']]
df_vote_scatter = sns.FacetGrid(df_vote, col='genres', col_wrap=4, hue='vote_average')
df_vote_scatter.map(plt.scatter, 'vote_average', 'popularity', alpha=.7)
结论3-2:除去Western、Documentary、TV Movie、Foreign数据较少;可以发现随着平均评分由低到高,均存在较低的受欢迎度(popularity),而随着受欢迎度增加,对应的平均评分也相应提高。
2.1960-2016最受欢迎(popularity)前五的电影类型是Adventure,Science Fiction,Fantasy,Action,Animation;其中Adventure,Science Fiction,Action在2010年后受欢迎度有明显提升。
有3种电影类型,由于对"budget_adj"异常值的处理 以及 本身的数据缺失,导致1985年以前的数据不连贯,因此5种最受欢迎类型的直接比较仅适用于1985年之后的变化。
不同类型下的"popularity"均存在数值较大的数据,但不了解实际打分依据就无法真正判定是否为异常值,因此未做处理;另外根据Western、Documentary、TV Movie、Foreign四种类型,无法得出上述结论,可能由于数据量有限的缘故。