kaggle TMDB5000电影数据分析和电影推荐模型数据分析相关函数解释参考文章:

数据来自kaggle上tmdb5000电影数据集,本次数据分析主要包括电影数据可视化和简单的电影推荐模型,如:
1.电影类型分配及其随时间的变化
2.利润、评分、受欢迎程度直接的关系
3.哪些导演的电影卖座或较好
4.最勤劳的演职人员
5.电影关键字分析
6.电影相似性推荐

数据分析

    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    %matplotlib inline
    plt.style.use('ggplot')
    import json
    import warnings
    warnings.filterwarnings('ignore')#忽略警告
[/code]

```code
    movie = pd.read_csv('tmdb_5000_movies.csv')
    credit = pd.read_csv('tmdb_5000_credits.csv')
[/code]

```code
    movie.head(1)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [{“id”: 28, “name”: “Action”}, {“id”: 12, “nam…  |
http://www.avatarmovie.com/  |  19995  |  [{“id”: 1463, “name”: “culture
clash”}, {“id”:…  |  en  |  Avatar  |  In the 22nd century, a paraplegic
Marine is di…  |  150.437577  |  [{“name”: “Ingenious Film Partners”, “id”:
289…  |  [{“iso_3166_1”: “US”, “name”: “United States o…  |  2009-12-10  |
2787965087  |  162.0  |  [{“iso_639_1”: “en”, “name”: “English”}, {“iso…  |
Released  |  Enter the World of Pandora.  |  Avatar  |  7.2  |  11800

```code
    movie.tail(3)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
4800  |  0  |  [{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam…  |
http://www.hallmarkchannel.com/signedsealeddel…  |  231617  |  [{“id”: 248,
“name”: “date”}, {“id”: 699, “nam…  |  en  |  Signed, Sealed, Delivered  |
“Signed, Sealed, Delivered” introduces a dedic…  |  1.444476  |  [{“name”:
“Front Street Pictures”, “id”: 3958}…  |  [{“iso_3166_1”: “US”, “name”:
“United States o…  |  2013-10-13  |  0  |  120.0  |  [{“iso_639_1”: “en”,
“name”: “English”}]  |  Released  |  NaN  |  Signed, Sealed, Delivered  |  7.0
|  6  
4801  |  0  |  []  |  http://shanghaicalling.com/  |  126186  |  []  |  en  |
Shanghai Calling  |  When ambitious New York attorney Sam is sent t…  |
0.857008  |  []  |  [{“iso_3166_1”: “US”, “name”: “United States o…  |
2012-05-03  |  0  |  98.0  |  [{“iso_639_1”: “en”, “name”: “English”}]  |
Released  |  A New Yorker in Shanghai  |  Shanghai Calling  |  5.7  |  7  
4802  |  0  |  [{“id”: 99, “name”: “Documentary”}]  |  NaN  |  25975  |
[{“id”: 1523, “name”: “obsession”}, {“id”: 224…  |  en  |  My Date with Drew
|  Ever since the second grade when he first saw …  |  1.929883  |  [{“name”:
“rusty bear entertainment”, “id”: 87…  |  [{“iso_3166_1”: “US”, “name”:
“United States o…  |  2005-08-05  |  0  |  90.0  |  [{“iso_639_1”: “en”,
“name”: “English”}]  |  Released  |  NaN  |  My Date with Drew  |  6.3  |  16

```code
    movie.info()#样本数量为4803,部分特征有缺失值
[/code]

```code
    
    RangeIndex: 4803 entries, 0 to 4802
    Data columns (total 20 columns):
    budget                  4803 non-null int64
    genres                  4803 non-null object
    homepage                1712 non-null object
    id                      4803 non-null int64
    keywords                4803 non-null object
    original_language       4803 non-null object
    original_title          4803 non-null object
    overview                4800 non-null object
    popularity              4803 non-null float64
    production_companies    4803 non-null object
    production_countries    4803 non-null object
    release_date            4802 non-null object
    revenue                 4803 non-null int64
    runtime                 4801 non-null float64
    spoken_languages        4803 non-null object
    status                  4803 non-null object
    tagline                 3959 non-null object
    title                   4803 non-null object
    vote_average            4803 non-null float64
    vote_count              4803 non-null int64
    dtypes: float64(3), int64(4), object(13)
    memory usage: 750.5+ KB

样本数为4803,部分特征有缺失值,homepage,tagline缺损较多,但这俩不影响基本分析,release_date和runtime可以填充;仔细观察,部分样本的genres,keywords,production
company特征值是[],需要注意。

    credit.info
[/code]

##  数据清理

数据特征中有很多特征为json格式,即类似于字典的键值对形式,为了方便后续处理,我们需要将其转换成便于python操作的str或者list形式,利于提取有用信息。

```code
    #movie genres电影流派,便于归类
    movie['genres']=movie['genres'].apply(json.loads)
    #apply function to axis in df,对df中某一行、列应用某种操作。
[/code]

```code
    movie['genres'].head(1)
[/code]

```code
    0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
    Name: genres, dtype: object
    list(zip(movie.index,movie['genres']))[:2]
[/code]

```code
    [(0,
      [{'id': 28, 'name': 'Action'},
       {'id': 12, 'name': 'Adventure'},
       {'id': 14, 'name': 'Fantasy'},
       {'id': 878, 'name': 'Science Fiction'}]),
     (1,
      [{'id': 12, 'name': 'Adventure'},
       {'id': 14, 'name': 'Fantasy'},
       {'id': 28, 'name': 'Action'}])]
    for index,i in zip(movie.index,movie['genres']):
        list1=[]
        for j in range(len(i)):
            list1.append((i[j]['name']))# name:genres,Action...
        movie.loc[index,'genres']=str(list1)
[/code]

```code
    movie.head(1)
    #genres列已经不是json格式,而是将name将的value即电影类型提取出来重新赋值给genres
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…  |
http://www.avatarmovie.com/  |  19995  |  [{“id”: 1463, “name”: “culture
clash”}, {“id”:…  |  en  |  Avatar  |  In the 22nd century, a paraplegic
Marine is di…  |  150.437577  |  [{“name”: “Ingenious Film Partners”, “id”:
289…  |  [{“iso_3166_1”: “US”, “name”: “United States o…  |  2009-12-10  |
2787965087  |  162.0  |  [{“iso_639_1”: “en”, “name”: “English”}, {“iso…  |
Released  |  Enter the World of Pandora.  |  Avatar  |  7.2  |  11800

```code
    #同样的方法应用到keywords列
    movie['keywords'] = movie['keywords'].apply(json.loads)
    for index,i in zip(movie.index,movie['keywords']):
        list2=[]
        for j in range(len(i)):
            list2.append(i[j]['name'])
        movie.loc[index,'keywords'] = str(list2)
[/code]

```code
    #同理production_companies
    movie['production_companies'] = movie['production_companies'].apply(json.loads)
    for index,i in zip(movie.index,movie['production_companies']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        movie.loc[index,'production_companies']=str(list3)
[/code]

```code
    movie['production_countries'] = movie['production_countries'].apply(json.loads)
    for index,i in zip(movie.index,movie['production_countries']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        movie.loc[index,'production_countries']=str(list3)
[/code]

```code
    movie['spoken_languages'] = movie['spoken_languages'].apply(json.loads)
    for index,i in zip(movie.index,movie['spoken_languages']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        movie.loc[index,'spoken_languages']=str(list3)
[/code]

```code
    movie.head(1)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…  |
http://www.avatarmovie.com/  |  19995  |  [‘culture clash’, ‘future’, ‘space
war’, ‘spac…  |  en  |  Avatar  |  In the 22nd century, a paraplegic Marine is
di…  |  150.437577  |  [‘Ingenious Film Partners’, ‘Twentieth Century…  |
[‘United States of America’, ‘United Kingdom’]  |  2009-12-10  |  2787965087
|  162.0  |  [‘English’, ‘Español’]  |  Released  |  Enter the World of
Pandora.  |  Avatar  |  7.2  |  11800

```code
    credit.head(1)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |
movie_id  |  title  |  cast  |  crew  
---|---|---|---|---  
0  |  19995  |  Avatar  |  [{“cast_id”: 242, “character”: “Jake Sully”, “…  |
[{“credit_id”: “52fe48009251416c750aca23”, “de…

```code
    credit['cast'] = credit['cast'].apply(json.loads)
    for index,i in zip(credit.index,credit['cast']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        credit.loc[index,'cast']=str(list3)
[/code]

```code
    credit['crew'] = credit['crew'].apply(json.loads)
    #提取crew中director,增加电影导演一列,用作后续分析
    def director(x):
        for i in x:
            if i['job'] == 'Director':
                return i['name']
    credit['crew']=credit['crew'].apply(director)
    credit.rename(columns={'crew':'director'},inplace=True)
[/code]

```code
    credit.head(1)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |
movie_id  |  title  |  cast  |  director  
---|---|---|---|---  
0  |  19995  |  Avatar  |  [‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney …  |
James Cameron  
  
观察movie中id和credit中movie_id相同,可以将两个表合并,将所有信息统一在一个表中。

```code
    fulldf = pd.merge(movie,credit,left_on='id',right_on='movie_id',how='left')
[/code]

```code
    fulldf.head(1)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |  …  |
spoken_languages  |  status  |  tagline  |  title_x  |  vote_average  |
vote_count  |  movie_id  |  title_y  |  cast  |  director  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…  |
http://www.avatarmovie.com/  |  19995  |  [‘culture clash’, ‘future’, ‘space
war’, ‘spac…  |  en  |  Avatar  |  In the 22nd century, a paraplegic Marine is
di…  |  150.437577  |  [‘Ingenious Film Partners’, ‘Twentieth Century…  |  …
|  [‘English’, ‘Español’]  |  Released  |  Enter the World of Pandora.  |
Avatar  |  7.2  |  11800  |  19995  |  Avatar  |  [‘Sam Worthington’, ‘Zoe
Saldana’, ‘Sigourney …  |  James Cameron  
  
1 rows × 24 columns

```code
    fulldf.shape
[/code]

(4803, 24)

```code
    #观察到有相同列title,合并后自动命名成title_x,title_y
    fulldf.rename(columns={'title_x':'title'},inplace=True)
    fulldf.drop('title_y',axis=1,inplace=True)
[/code]

```code
    #缺失值
    NAs = pd.DataFrame(fulldf.isnull().sum())
    NAs[NAs.sum(axis=1)>0].sort_values(by=[0],ascending=False)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  0  
---|---  
homepage  |  3091  
tagline  |  844  
director  |  30  
overview  |  3  
runtime  |  2  
release_date  |  1

```code
    #补充release_date
    fulldf.loc[fulldf['release_date'].isnull(),'title']
[/code]  
  
4553 America Is Still the Place Name: title, dtype: object

```code
    #上网查询补充
    fulldf['release_date']=fulldf['release_date'].fillna('2014-06-01')
[/code]

```code
    #runtime为电影时长,按均值补充
    fulldf['runtime'] = fulldf['runtime'].fillna(fulldf['runtime'].mean())
[/code]

```code
    #为方便分析,将release_date(object)转为datetime类型,并提取year,month
    fulldf['release_year'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.year
    fulldf['release_month'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.month
[/code]

##  数据探索

```code
    #电影类型genres
    #观察其格式,我们需要做str相关处理,先移除两边中括号
    #相邻类型间有空格,需要移除
    #再移除单引号,并按,分割提取即可
    fulldf['genres']=fulldf['genres'].str.strip('[]').str.replace(" ","").str.replace("'","")
[/code]

```code
    #每种类型现在以,分割
    fulldf['genres']=fulldf['genres'].str.split(',')
[/code]

```code
    list1=[]
    for i in fulldf['genres']:
        list1.extend(i)
    gen_list=pd.Series(list1).value_counts()[:10].sort_values(ascending=False)
    gen_df = pd.DataFrame(gen_list)
    gen_df.rename(columns={0:'Total'},inplace=True)
[/code]

```code
    fulldf.ix[4801]
[/code]

```code
      budget                                                                  0
    genres                                                                 []
    homepage                                      http://shanghaicalling.com/
    id                                                                 126186
    keywords                                                               []
    original_language                                                      en
    original_title                                           Shanghai Calling
    overview                When ambitious New York attorney Sam is sent t...
    popularity                                                       0.857008
    production_companies                                                   []
    production_countries                ['United States of America', 'China']
    release_date                                                   2012-05-03
    revenue                                                                 0
    runtime                                                                98
    spoken_languages                                              ['English']
    status                                                           Released
    tagline                                          A New Yorker in Shanghai
    title                                                    Shanghai Calling
    vote_average                                                          5.7
    vote_count                                                              7
    movie_id                                                           126186
    cast                    ['Daniel Henney', 'Eliza Coupe', 'Bill Paxton'...
    director                                                      Daniel Hsia
    release_year                                                         2012
    release_month                                                           5
    Name: 4801, dtype: object
    plt.subplots(figsize=(10,8))
    sns.barplot(y=gen_df.index,x='Total',data=gen_df,palette='GnBu_d')
    plt.xticks(fontsize=15)#设置刻度字体大小
    plt.yticks(fontsize=15)
    plt.xlabel('Total',fontsize=15)
    plt.ylabel('Genres',fontsize=15)
    plt.title('Top 10 Genres',fontsize=20)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523132516220?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

数量最多的前10种电影类型,有剧情、喜剧、惊悚、动作等,也是目前影院常见电影类型,那这些电影类型数量较多的背后原因有哪些呢?  
我们再看看电影数量和时间的关系。

```code
    #对电影类型去重
    l=[]
    for i in list1:
        if i not in l:
            l.append(i)
    #l.remove("")#有部分电影类型为空
    len(l)#l就是去重后的电影类型
[/code]

21

```code
    year_min = fulldf['release_year'].min()
    year_max = fulldf['release_year'].max()
    
    year_genr =pd.DataFrame(index=l,columns=range(year_min,year_max+1))#生成类型为index,年份为列的dataframe,用于每种类型在各年份的数量
    year_genr.fillna(value=0,inplace=True)#初始值为0
    
    
    intil_y = np.array(fulldf['release_year'])#用于遍历所有年份
    z = 0
    for i in fulldf['genres']:
        splt_gen = list(i)#每一部电影的所有类型
        for j in splt_gen:
            year_genr.loc[j,intil_y[z]] = year_genr.loc[j,intil_y[z]]+1#计数该类型电影在某一年份的数量
        z+=1
    year_genr = year_genr.sort_values(by=2006,ascending=False)
    year_genr = year_genr.iloc[0:10,-49:-1]
    year_genr
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  1969
|  1970  |  1971  |  1972  |  1973  |  1974  |  1975  |  1976  |  1977  |
1978  |  …  |  2007  |  2008  |  2009  |  2010  |  2011  |  2012  |  2013  |
2014  |  2015  |  2016  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
Drama  |  7  |  8  |  3  |  6  |  4  |  3  |  2  |  4  |  8  |  5  |  …  |  97
|  106  |  122  |  115  |  99  |  79  |  110  |  110  |  95  |  37  
Comedy  |  3  |  4  |  1  |  3  |  3  |  3  |  3  |  3  |  3  |  1  |  …  |
67  |  82  |  97  |  87  |  82  |  80  |  71  |  62  |  52  |  26  
Thriller  |  3  |  2  |  3  |  1  |  2  |  1  |  1  |  2  |  4  |  5  |  …  |
53  |  55  |  59  |  56  |  69  |  58  |  53  |  66  |  67  |  27  
Action  |  4  |  4  |  4  |  1  |  2  |  1  |  1  |  2  |  6  |  5  |  …  |
44  |  46  |  51  |  49  |  58  |  43  |  56  |  54  |  46  |  39  
Romance  |  2  |  1  |  3  |  0  |  1  |  2  |  1  |  2  |  2  |  3  |  …  |
37  |  38  |  57  |  45  |  30  |  39  |  25  |  24  |  23  |  9  
Family  |  0  |  0  |  1  |  0  |  0  |  1  |  0  |  1  |  0  |  1  |  …  |
20  |  29  |  28  |  29  |  28  |  17  |  22  |  23  |  17  |  9  
Crime  |  3  |  0  |  2  |  3  |  2  |  2  |  0  |  2  |  0  |  0  |  …  |  28
|  33  |  32  |  30  |  24  |  27  |  37  |  27  |  26  |  10  
Adventure  |  2  |  3  |  1  |  2  |  1  |  2  |  2  |  2  |  5  |  4  |  …  |
25  |  37  |  36  |  30  |  32  |  25  |  36  |  37  |  35  |  23  
Fantasy  |  0  |  0  |  1  |  0  |  0  |  0  |  1  |  0  |  2  |  2  |  …  |
19  |  20  |  22  |  21  |  15  |  19  |  21  |  16  |  10  |  13  
Horror  |  0  |  0  |  1  |  1  |  1  |  1  |  1  |  1  |  3  |  4  |  …  |
27  |  21  |  30  |  27  |  24  |  33  |  25  |  21  |  33  |  20  
  
10 rows × 48 columns

```code
    plt.subplots(figsize=(10,8))
    plt.plot(year_genr.T)
    plt.title('Genres vs Time',fontsize=20)
    plt.xticks(range(1969,2020,5))
    plt.legend(year_genr.T)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523132553536?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

可以看到,从1994年左右,电影进入繁荣发展时期,各种类型的电影均有大幅增加,而增加最多的又以剧情、喜剧、惊悚、动作等类型电影,可见,这些类型电影数量居多和电影艺术整体繁荣发展有一定关系。

```code
    #为了方便分析,构造一个新的dataframe,选取部分特征,分析这些特征和电影类型的关系。
    partdf = fulldf[['title','vote_average','vote_count','release_year','popularity','budget','revenue']].reset_index(drop=True)
[/code]

```code
    partdf.head(2)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  title
|  vote_average  |  vote_count  |  release_year  |  popularity  |  budget  |
revenue  
---|---|---|---|---|---|---|---  
0  |  Avatar  |  7.2  |  11800  |  2009  |  150.437577  |  237000000  |
2787965087  
1  |  Pirates of the Caribbean: At World’s End  |  6.9  |  4500  |  2007  |
139.082615  |  300000000  |  961000000  
  
因为一部电影可能有多种电影类型,将每种类型加入column中,对每部电影,是某种类型就赋值1,不是则赋值0

```code
    for per in l:
        partdf[per]=0
    
        z=0
        for gen in fulldf['genres']:
    
            if per in list(gen):
                partdf.loc[z,per] = 1
            else:
                partdf.loc[z,per] = 0
            z+=1
    partdf.head(2)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  title
|  vote_average  |  vote_count  |  release_year  |  popularity  |  budget  |
revenue  |  Action  |  Adventure  |  Fantasy  |  …  |  Romance  |  Horror  |
Mystery  |  History  |  War  |  Music  |  Documentary  |  Foreign  |  TVMovie
|  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  Avatar  |  7.2  |  11800  |  2009  |  150.437577  |  237000000  |
2787965087  |  1  |  1  |  1  |  …  |  0  |  0  |  0  |  0  |  0  |  0  |  0
|  0  |  0  |  0  
1  |  Pirates of the Caribbean: At World’s End  |  6.9  |  4500  |  2007  |
139.082615  |  300000000  |  961000000  |  1  |  1  |  1  |  …  |  0  |  0  |
0  |  0  |  0  |  0  |  0  |  0  |  0  |  0  
  
2 rows × 28 columns

现在我们想了解每种电影类型一些特征的平均值,创建一个新的dataframe,index就是电影类型,列是平均特征,如平分vote,收入revenue,受欢迎程度等。

```code
    mean_gen = pd.DataFrame(l)
[/code]

```code
    #点评分数取均值
    newArray = []*len(l)
    for genre in l:
        newArray.append(partdf.groupby(genre, as_index=True)['vote_average'].mean())
    #现在newArray中是按类型[0]平均值[1]平均值存放,我们只关心[1]的值。
    newArray2 = []*len(l)
    for i in range(len(l)):
        newArray2.append(newArray[i][1])
    
    mean_gen['mean_votes_average']=newArray2
    mean_gen.head(2)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  0  |
mean_votes_average  
---|---|---  
0  |  Action  |  5.989515  
1  |  Adventure  |  6.156962

```code
    #同理,用到别的特征上
    #预算budget
    newArray = []*len(l)
    for genre in l:
        newArray.append(partdf.groupby(genre, as_index=True)['budget'].mean())
    newArray2 = []*len(l)
    for i in range(len(l)):
        newArray2.append(newArray[i][1])
    
    mean_gen['mean_budget']=newArray2
[/code]

```code
    #收入revenue
    newArray = []*len(l)
    for genre in l:
        newArray.append(partdf.groupby(genre, as_index=True)['revenue'].mean())
    newArray2 = []*len(l)
    for i in range(len(l)):
        newArray2.append(newArray[i][1])
    
    mean_gen['mean_revenue']=newArray2
[/code]

```code
    #popularity:相关页面查看次数
    newArray = []*len(l)
    for genre in l:
        newArray.append(partdf.groupby(genre, as_index=True)['popularity'].mean())
    newArray2 = []*len(l)
    for i in range(len(l)):
        newArray2.append(newArray[i][1])
    
    mean_gen['mean_popular']=newArray2
[/code]

```code
    #vote_count:评分次数取count
    newArray = []*len(l)
    for genre in l:
        newArray.append(partdf.groupby(genre, as_index=True)['vote_count'].count())
    newArray2 = []*len(l)
    for i in range(len(l)):
        newArray2.append(newArray[i][1])
    
    mean_gen['vote_count']=newArray2
[/code]

```code
    mean_gen.rename(columns={0:'genre'},inplace=True)
    mean_gen.replace('','none',inplace=True)
    #none代表有些电影类型或其他特征有缺失,可以看到数量很小,我们将其舍得不考虑
    mean_gen.drop(20,inplace=True)
[/code]

```code
    mean_gen['vote_count'].describe()
[/code]  
  
count 20.000000  
mean 608.000000  
std 606.931974  
min 8.000000  
25% 174.750000  
50% 468.500000  
75% 816.000000  
max 2297.000000  
Name: vote_count, dtype: float64

```code
    mean_gen['mean_votes_average'].describe()
[/code]

count 20.000000  
mean 6.173921  
std 0.278476  
min 5.626590  
25% 6.009644  
50% 6.180978  
75% 6.344325  
max 6.719797  
Name: mean_votes_average, dtype: float64

```code
    #fig = plt.figure(figsize=(10, 8))
    f,ax = plt.subplots(figsize=(10,6))
    ax1 = f.add_subplot(111)
    ax2 = ax1.twinx()
    grid1 = sns.factorplot(x='genre', y='mean_votes_average',data=mean_gen,ax=ax1)
    ax1.axes.set_ylabel('votes_average')
    ax1.axes.set_ylim((4,7))
    
    grid2 = sns.factorplot(x='genre',y='mean_popular',data=mean_gen,ax=ax2,color='blue')
    ax2.axes.set_ylabel('popularity')
    ax2.axes.set_ylim((0,40))
    ax1.set_xticklabels(mean_gen['genre'],rotation=90)
    
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523132655277?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

从上图可知,外国电影并不受欢迎,虽然评分不低,但也是因为评分人数太少,动漫电影(Animation)、科幻(Science
Fiction)、奇幻电影(Fantasy)、动作片(Action)受欢迎程度较高,评分也不低,数量最多的剧情片评分很高,但受欢迎程度较低,猜测可能大部分剧情片不是商业类型。

```code
    mean_gen['profit'] = mean_gen['mean_revenue']-mean_gen['mean_budget']
[/code]

```code
    s = mean_gen['profit'].sort_values(ascending=False)[:10]
    pdf = mean_gen.ix[s.index]
    
    plt.subplots(figsize=(10,6))
    sns.barplot(x='profit',y='genre',data=pdf,palette='BuGn_r')
    plt.xticks(fontsize=15)#设置刻度字体大小
    plt.yticks(fontsize=15)
    plt.xlabel('Profit',fontsize=15)
    plt.ylabel('Genres',fontsize=15)
    plt.title('Top 10 Profit of Genres',fontsize=20)
    
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523132747502?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

可以看出,动画、探险、家庭和科幻是最赚钱的电影类型,适合去电影院观看,同时也是受欢迎的类型,那么我们看看变量的关系。

```code
    cordf = partdf.drop(l,axis=1)
    cordf.columns#含有我们想了解的特征,适合分析
[/code]

```code
     Index(['title', 'vote_average', 'vote_count', 'release_year', 'popularity',
           'budget', 'revenue'],
          dtype='object')
    corrmat = cordf.corr()
    f, ax = plt.subplots(figsize=(10,7))
    sns.heatmap(corrmat,cbar=True, annot=True,vmax=.8, cmap='PuBu',square=True)
[/code]

![png](https://img-
blog.csdn.net/20180523132843861?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

从上图可以看出,评分次数和受欢迎程度有比较强的关系,证明看的人多参与度也高,预算和票房也关系较强,票房和受欢迎程度、评分次数也有比较强的关系,为电影做好宣传很重要,我们再进一步看一下。

```code
    #budget, revenue在数据中都有为0的项,我们去除这些脏数据,
    partdf = partdf[partdf['budget']>0]
    partdf = partdf[partdf['revenue']>0]
    partdf = partdf[partdf['vote_count']>3]
    plt.subplots(figsize=(6,5))
    
    plt.xlabel('Budget',fontsize=15)
    plt.ylabel('Revenue',fontsize=15)
    plt.title('Budget vs Revenue',fontsize=20)
    sns.regplot(x='budget',y='revenue',data=partdf,ci=None)
[/code]

![png](https://img-
blog.csdn.net/20180523132916443?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

```code
    plt.subplots(figsize=(6,5))
    plt.xlabel('vote_average',fontsize=15)
    plt.ylabel('popularity',fontsize=15)
    plt.title('Score vs Popular',fontsize=20)
    sns.regplot(x='vote_average',y='popularity',data=partdf)
[/code]

![png](https://img-
blog.csdn.net/20180523132946451?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

可以看出,成本和票房、评分高低和受欢迎程度还是呈线性关系的。但成本较低的电影,成本对票房的影响不大,评分高的的电影基本上也很受欢迎,我们再看看究竟是哪几部电影最挣钱、最受欢迎、口碑最好。

```code
    print(partdf.loc[partdf['revenue']==partdf['revenue'].max()]['title'])
    print(partdf.loc[partdf['popularity']==partdf['popularity'].max()]['title'])
    print(partdf.loc[partdf['vote_average']==partdf['vote_average'].max()]['title'])
[/code]

0 Avatar  
Name: title, dtype: object  
546 Minions  
Name: title, dtype: object  
1881 The Shawshank Redemption  
Name: title, dtype: object

```code
    partdf['profit'] = partdf['revenue']-partdf['budget']
    print(partdf.loc[partdf['profit']==partdf['profit'].max()]['title'])
[/code]

0 Avatar  
Name: title, dtype: object

小黄人电影最受欢迎,阿凡达最赚钱,肖申克的救赎口碑最好。

```code
    s1 = cordf.groupby(by='release_year').budget.sum()
    s2 = cordf.groupby(by='release_year').revenue.sum()
    sdf = pd.concat([s1,s2],axis=1)
    sdf = sdf.iloc[-39:-2]
    plt.plot(sdf)
    plt.xticks(range(1979,2020,5))
    plt.legend(sdf)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523133047409?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

电影业果然是蓬勃发展啊!现在大制作的电影越来越多,看来是有原因的啊!

对于科幻迷们,也可以看看最受欢迎的科幻电影都有哪些:

```code
    #最受欢迎的科幻电影
    s = partdf.loc[partdf['ScienceFiction']==1,'popularity'].sort_values(ascending=False)[:10]
    sdf = partdf.ix[s.index]
    sns.barplot(x='popularity',y='title',data=sdf)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523133107982?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

星际穿越最受欢迎,银河护卫队紧随其后,同理,我们也可以了解其他电影类型的情况。现在。让我们再看看电影人对电影市场的影响,一部好电影离不开台前幕后工作人员的贡献,是每一位优秀的电影人为我们带来好看的电影,这里,我们主要分析导演和演员。

```code
    #平均票房最高的导演
    rev_d = fulldf.groupby('director')['revenue'].mean()
    top_rev_d = rev_d.sort_values(ascending=False).head(20)
    top_rev_d = pd.DataFrame(top_rev_d)
    plt.subplots(figsize=(10,6))
    sns.barplot(x='revenue',y=top_rev_d.index,data=top_rev_d,palette='BuGn_r')
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15)
    plt.xlabel('Average Revenue',fontsize=15)
    plt.ylabel('Director',fontsize=15)
    plt.title('Top 20 Revenue by Director',fontsize=20)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523133135909?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

如图是市场好的导演,那么电影产量最高、或者既叫好又叫座的导演有哪些呢?

```code
    list2 = fulldf[fulldf['director']!=''].director.value_counts()[:10].sort_values(ascending=True)
    list2 = pd.Series(list2)
    list2
[/code]

Oliver Stone 14  
Renny Harlin 15  
Steven Soderbergh 15  
Robert Rodriguez 16  
Spike Lee 16  
Ridley Scott 16  
Martin Scorsese 20  
Clint Eastwood 20  
Woody Allen 21  
Steven Spielberg 27  
Name: director, dtype: int64

```code
    plt.subplots(figsize=(10,6))
    ax = list2.plot.barh(width=0.85,color='y')
    for i,v in enumerate(list2.values):
        ax.text(.5, i, v,fontsize=12,color='white',weight='bold')
    ax.patches[9].set_facecolor('g')
    plt.title('Directors with highest movies')
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/2018052313320860?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

```code
    top_vote_d = fulldf[fulldf['vote_average']>=8].sort_values(by='vote_average',ascending=False)
    top_vote_d = top_vote_d.dropna()
    top_vote_d = top_vote_d.loc[:,['director','vote_average']]
[/code]

```code
    tmp = rev_d.sort_values(ascending=False)
    vote_rev_d = tmp[tmp.index.isin(list(top_vote_d['director']))]
    vote_rev_d = vote_rev_d.sort_values(ascending=False)
    vote_rev_d = pd.DataFrame(vote_rev_d)
[/code]

```code
    plt.subplots(figsize=(10,6))
    sns.barplot(x='revenue',y=vote_rev_d.index,data=vote_rev_d,palette='BuGn_r')
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15)
    plt.xlabel('Average Revenue',fontsize=15)
    plt.ylabel('Director',fontsize=15)
    plt.title('Revenue by vote above 8 Director',fontsize=20)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/2018052313323079?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

再看看演职人员,cast特征里每一部电影有很多演职人员,幸运的是,cast是按演职人员的重要程度排序的,那么排名靠前的我们可以认为是主要演员。

```code
    fulldf['cast']=fulldf['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
    fulldf['cast']=fulldf['cast'].str.split(',')
    list1=[]
    for i in fulldf['cast']:
        list1.extend(i)
    list1 = pd.Series(list1)
    list1 = list1.value_counts()[:15].sort_values(ascending=True)
    plt.subplots(figsize=(10,6))
    ax = list1.plot.barh(width=0.9,color='green')
    for i,v in enumerate(list1.values):
        ax.text(.8, i, v,fontsize=10,color='white',weight='bold')
    plt.title('Actors with highest appearance')
    ax.patches[14].set_facecolor('b')
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/2018052313325049?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

```code
    fulldf['keywords'][2]
[/code]

“[‘spy’, ‘based on novel’, ‘secret agent’, ‘sequel’, ‘mi6’, ‘british secret
service’, ‘united kingdom’]”

```code
    from wordcloud import WordCloud, STOPWORDS
    import nltk
    from nltk.corpus import stopwords
    #如果stopwords报错没有安装,可以在anaconda cmd中import nltk;nltk.download()
    #在弹出窗口中选择corpa,stopword,刷新并下载
    import io
    from PIL import Image
[/code]

```code
    plt.subplots(figsize=(12,12))
    stop_words=set(stopwords.words('english'))
    stop_words.update(',',';','!','?','.','(',')','$','#','+',':','...',' ','')
    
    img1 = Image.open('timg1.jpg')
    hcmask1 = np.array(img1)
    words=fulldf['keywords'].dropna().apply(nltk.word_tokenize)
    word=[]
    for i in words:
        word.extend(i)
    word=pd.Series(word)
    word=([i for i in word.str.lower() if i not in stop_words])
    wc = WordCloud(background_color="black", max_words=4000, mask=hcmask1,
                   stopwords=STOPWORDS, max_font_size= 60)
    wc.generate(" ".join(word))
    
    
    plt.imshow(wc,interpolation="bilinear")
    plt.axis('off')
    plt.figure()
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523133325401?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

我们可以对关键词有大概了解,女性导演、独立电影占比较大,这也可能是电影的一个发展趋势。

##  电影推荐模型

现在我们根据上述的分析,可以考虑做一个电影推荐,通常来说,我们在搜索电影时,我们会去找同类的电影、或者同一导演演员的电影、或者评分较高的电影,那么需要的特征有genres,cast,director,score

```code
    l[:5]
[/code]

[‘Action’, ‘Adventure’, ‘Fantasy’, ‘ScienceFiction’, ‘Crime’]

###  特征向量化

####  genre

```code
    def binary(genre_list):
        binaryList = []
    
        for genre in l:
            if genre in genre_list:
                binaryList.append(1)
            else:
                binaryList.append(0)
    
        return binaryList
[/code]

```code
    fulldf['genre_vec'] = fulldf['genres'].apply(lambda x: binary(x))
[/code]

```code
    fulldf['genre_vec'][0]
[/code]

[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

####  cast

```code
    for i,j in zip(fulldf['cast'],fulldf.index):
        list2=[]
        list2=i[:4]
        list2.sort()
        fulldf.loc[j,'cast']=str(list2)
    fulldf['cast'][0]
[/code]

“[‘SamWorthington’, ‘SigourneyWeaver’, ‘StephenLang’, ‘ZoeSaldana’]”

```code
    fulldf['cast']=fulldf['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'')
    fulldf['cast']=fulldf['cast'].str.split(',')
    fulldf['cast'][0]
[/code]

[‘SamWorthington’, ‘SigourneyWeaver’, ‘StephenLang’, ‘ZoeSaldana’]

```code
    castList = []
    for index, row in fulldf.iterrows():
        cast = row["cast"]
        for i in cast:
            if i not in castList:
                castList.append(i)
[/code]

```code
    len(castList)
[/code]

7515

```code
    def binary(cast_list):
        binaryList = []
    
        for genre in castList:
            if genre in cast_list:
                binaryList.append(1)
            else:
                binaryList.append(0)
    
        return binaryList
[/code]

```code
    fulldf['cast_vec'] = fulldf['cast'].apply(lambda x:binary(x))
    fulldf['cast_vec'].head(2)
[/code]

0 [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  
1 [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …  
Name: cast_vec, dtype: object

####  director

```code
    fulldf['director'][0]
[/code]

‘James Cameron’

```code
    def xstr(s):
        if s is None:
            return ''
        return str(s)
    fulldf['director']=fulldf['director'].apply(xstr)
[/code]

```code
    directorList=[]
    for i in fulldf['director']:
        if i not in directorList:
            directorList.append(i)
[/code]

```code
    def binary(director_list):
        binaryList = []
    
        for direct in directorList:
            if direct in director_list:
                binaryList.append(1)
            else:
                binaryList.append(0)
    
        return binaryList
[/code]

```code
    fulldf['director_vec'] = fulldf['director'].apply(lambda x:binary(x))
[/code]

####  keywords

```code
    fulldf['keywords'][0]
[/code]

“[‘culture clash’, ‘future’, ‘space war’, ‘space colony’, ‘society’, ‘space
travel’, ‘futuristic’, ‘romance’, ‘space’, ‘alien’, ‘tribe’, ‘alien planet’,
‘cgi’, ‘marine’, ‘soldier’, ‘battle’, ‘love affair’, ‘anti war’, ‘power
relations’, ‘mind and soul’, ‘3d’]”

```code
    #change keywords to type list
    fulldf['keywords']=fulldf['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
    fulldf['keywords']=fulldf['keywords'].str.split(',')
[/code]

```code
    for i,j in zip(fulldf['keywords'],fulldf.index):
        list2=[]
        list2 = i
        list2.sort()
        fulldf.loc[j,'keywords']=str(list2)
    fulldf['keywords'][0]
[/code]

“[‘3d’, ‘alien’, ‘alienplanet’, ‘antiwar’, ‘battle’, ‘cgi’, ‘cultureclash’,
‘future’, ‘futuristic’, ‘loveaffair’, ‘marine’, ‘mindandsoul’,
‘powerrelations’, ‘romance’, ‘society’, ‘soldier’, ‘space’, ‘spacecolony’,
‘spacetravel’, ‘spacewar’, ‘tribe’]”

```code
    fulldf['keywords']=fulldf['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
    fulldf['keywords']=fulldf['keywords'].str.split(',')
[/code]

```code
    words_list = []
    for index, row in fulldf.iterrows():
        genres = row["keywords"]
    
        for genre in genres:
            if genre not in words_list:
                words_list.append(genre)
    len(words_list)
[/code]

9772

```code
    def binary(words):
        binaryList = []
    
        for genre in words_list:
            if genre in words:
                binaryList.append(1)
            else:
                binaryList.append(0)
    
        return binaryList
[/code]

```code
    fulldf['words_vec'] = fulldf['keywords'].apply(lambda x: binary(x))
[/code]

####  recommend model

取余弦值作为相似性度量,根据选取的特征向量计算影片间的相似性;计算距离最近的前10部影片作为推荐

```code
    fulldf=fulldf[(fulldf['vote_average']!=0)] #removing the fulldf with 0 score and without drector names 
    fulldf=fulldf[fulldf['director']!='']
[/code]

```code
    from scipy import spatial
    
    def Similarity(movieId1, movieId2):
        a = fulldf.iloc[movieId1]
        b = fulldf.iloc[movieId2]
    
        genresA = a['genre_vec']
        genresB = b['genre_vec']
        genreDistance = spatial.distance.cosine(genresA, genresB)
    
        castA = a['cast_vec']
        castB = b['cast_vec']
        castDistance = spatial.distance.cosine(castA, castB)
    
        directA = a['director_vec']
        directB = b['director_vec']
        directDistance = spatial.distance.cosine(directA, directB)
    
        wordsA = a['words_vec']
        wordsB = b['words_vec']
        wordsDistance = spatial.distance.cosine(directA, directB)
        return genreDistance + directDistance + castDistance + wordsDistance
[/code]

```code
    Similarity(3,160)
[/code]

2.7958758547680684

```code
    columns =['original_title','genres','vote_average','genre_vec','cast_vec','director','director_vec','words_vec']
    tmp = fulldf.copy()
    tmp =tmp[columns]
    tmp['id'] = list(range(0,fulldf.shape[0]))
    tmp.head()
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |
original_title  |  genres  |  vote_average  |  genre_vec  |  cast_vec  |
director  |  director_vec  |  words_vec  |  id  
---|---|---|---|---|---|---|---|---|---  
0  |  Avatar  |  [Action, Adventure, Fantasy, ScienceFiction]  |  7.2  |  [1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, …  |  James Cameron  |  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, …  |  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …  |  0  
1  |  Pirates of the Caribbean: At World’s End  |  [Adventure, Fantasy,
Action]  |  6.9  |  [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0,
0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …  |  Gore Verbinski  |  [0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, …  |  1  
2  |  Spectre  |  [Action, Adventure, Crime]  |  6.3  |  [1, 1, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, …
|  Sam Mendes  |  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  2  
3  |  The Dark Knight Rises  |  [Action, Crime, Drama, Thriller]  |  7.6  |
[1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, …  |  Christopher Nolan  |  [0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, …  |  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  3  
4  |  John Carter  |  [Action, Adventure, ScienceFiction]  |  6.1  |  [1, 1,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, …  |  Andrew Stanton  |  [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, …  |  [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  4

```code
    tmp.isnull().sum()
[/code]  
  
original_title 0  
genres 0  
vote_average 0  
genre_vec 0  
cast_vec 0  
director 0  
director_vec 0  
words_vec 0  
id 0  
dtype: int64

```code
    import operator
    def recommend(name):
        film=tmp[tmp['original_title'].str.contains(name)].iloc[0].to_frame().T
        print('Selected Movie: ',film.original_title.values[0])
        def getNeighbors(baseMovie):
            distances = []
            for index, movie in tmp.iterrows():
                if movie['id'] != baseMovie['id'].values[0]:
                    dist = Similarity(baseMovie['id'].values[0], movie['id'])
                    distances.append((movie['id'], dist))
    
            distances.sort(key=operator.itemgetter(1))
    
            neighbors = []
            for x in range(10):
                neighbors.append(distances[x])
            return neighbors
        neighbors = getNeighbors(film)
        print('\nRecommended Movies: \n')
    
        for nei in neighbors:  
            print( tmp.iloc[nei[0]][0]+" | Genres: "+
                  str(tmp.iloc[nei[0]][1]).strip('[]').replace(' ','')+" | Rating: "
                  +str(tmp.iloc[nei[0]][2]))
    
        print('\n')
[/code]

```code
    recommend('Godfather')
[/code]

Selected Movie: The Godfather: Part III

Recommended Movies:

```code
    The Godfather: Part II | Genres: 'Drama','Crime' | Rating: 8.3
    The Godfather | Genres: 'Drama','Crime' | Rating: 8.4
    The Rainmaker | Genres: 'Drama','Crime','Thriller' | Rating: 6.7
    The Outsiders | Genres: 'Crime','Drama' | Rating: 6.9
    The Conversation | Genres: 'Crime','Drama','Mystery' | Rating: 7.5
    The Cotton Club | Genres: 'Music','Drama','Crime','Romance' | Rating: 6.6
    Apocalypse Now | Genres: 'Drama','War' | Rating: 8.0
    Twixt | Genres: 'Horror','Thriller' | Rating: 5.0
    New York Stories | Genres: 'Comedy','Drama','Romance' | Rating: 6.2
    Peggy Sue Got Married | Genres: 'Comedy','Drama','Fantasy','Romance' | Rating: 5.9

相关函数解释

json格式处理

json是一种数据交换格式,以键值对的形式呈现,支持任何类型

  • json.loads用于解码json格式,将其转为dict;
  • 其逆操作,即转为json格式,是json.dumps(),若要存储为json文件,需要先dumps转换再写入
  • json.dump()用于将dict类型的数据转成str,并写入到json文件中,json.dump(json,file)
  • json.load()用于从json文件中读取数据。json.load(file)
    exam = {'a':'1111','b':'2222','c':'3333','d':'4444'}
    file = 'exam.json'
    jsobj = json.dumps(exam)
    # solution 1
    with open(file,'w') as f:
        f.write(jsobj)
        f.close()
    #solution 2
    json.dump(exam,open(file,'w'))
[/code]

##  zip()操作

  * zip()操作:用于将可迭代的对象作为参数,将对象中对应的元素打包成一个个元组,然后返回由这些元组组成的列表。 
  * 其逆操作为*zip(),举例如下: 

```code
    a = [1,2,3]
    b = [4,5,6]
    c = [4,5,6,7,8]
    zipped = zip(a,b)
    for i in zipped:
        print(i)
    print('\n')
    shor_z = zip(a,c)
    for j in shor_z:#取最短
        print(j)
[/code]

(1, 4) (2, 5) (3, 6) (1, 4) (2, 5) (3, 6)

```code
    z=list(zip(a,b))
    z
[/code]

[(1, 4), (2, 5), (3, 6)]

```code
    list(zip(*z))#转为list能看见
[/code]

[(1, 2, 3), (4, 5, 6)]

##  pandas merge/rename

pd.merge()通过键合并

```code
    a=pd.DataFrame({'lkey':['foo','foo','bar','bar'],'value':[1,2,3,4]})
    a
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  lkey
|  value  
---|---|---  
0  |  foo  |  1  
1  |  foo  |  2  
2  |  bar  |  3  
3  |  bar  |  4

```code
    for index,row in a.iterrows():
        print(index)
        print('*****')
        print(row)
[/code]  
  
0 ***** lkey foo value 1 Name: 0, dtype: object 1 ***** lkey foo value 2 Name:
1, dtype: object 2 ***** lkey bar value 3 Name: 2, dtype: object 3 ***** lkey
bar value 4 Name: 3, dtype: object

```code
    b=pd.DataFrame({'rkey':['foo','foo','bar','bar'],'value':[5,6,7,8]})
    b
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  rkey
|  value  
---|---|---  
0  |  foo  |  5  
1  |  foo  |  6  
2  |  bar  |  7  
3  |  bar  |  8

```code
    pd.merge(a,b,left_on='lkey',right_on='rkey',how='left')
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  lkey
|  value_x  |  rkey  |  value_y  
---|---|---|---|---  
0  |  foo  |  1  |  foo  |  5  
1  |  foo  |  1  |  foo  |  6  
2  |  foo  |  2  |  foo  |  5  
3  |  foo  |  2  |  foo  |  6  
4  |  bar  |  3  |  bar  |  7  
5  |  bar  |  3  |  bar  |  8  
6  |  bar  |  4  |  bar  |  7  
7  |  bar  |  4  |  bar  |  8

```code
    pd.merge(a,b,left_on='lkey',right_on='rkey',how='inner')
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  lkey
|  value_x  |  rkey  |  value_y  
---|---|---|---|---  
0  |  foo  |  1  |  foo  |  5  
1  |  foo  |  1  |  foo  |  6  
2  |  foo  |  2  |  foo  |  5  
3  |  foo  |  2  |  foo  |  6  
4  |  bar  |  3  |  bar  |  7  
5  |  bar  |  3  |  bar  |  8  
6  |  bar  |  4  |  bar  |  7  
7  |  bar  |  4  |  bar  |  8  
  
pd.rename()对行列重命名

```code
    dframe= pd.DataFrame(np.arange(12).reshape((3, 4)),
                     index=['NY', 'LA', 'SF'],
                     columns=['A', 'B', 'C', 'D'])
    dframe
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  A  |
B  |  C  |  D  
---|---|---|---|---  
NY  |  0  |  1  |  2  |  3  
LA  |  4  |  5  |  6  |  7  
SF  |  8  |  9  |  10  |  11

```code
    dframe.rename(columns={'A':'alpha'})
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  alpha
|  B  |  C  |  D  
---|---|---|---|---  
NY  |  0  |  1  |  2  |  3  
LA  |  4  |  5  |  6  |  7  
SF  |  8  |  9  |  10  |  11

```code
    dframe
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  A  |
B  |  C  |  D  
---|---|---|---|---  
NY  |  0  |  1  |  2  |  3  
LA  |  4  |  5  |  6  |  7  
SF  |  8  |  9  |  10  |  11

```code
    dframe.rename(columns={'A':'alpha'},inplace=True)
    dframe
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  alpha
|  B  |  C  |  D  
---|---|---|---|---  
NY  |  0  |  1  |  2  |  3  
LA  |  4  |  5  |  6  |  7  
SF  |  8  |  9  |  10  |  11  
  
##  pandas datetime格式

pandas to_datetime()转为datetime格式

##  Wordcloud

wordcloud词云模块:  
1.安装:在conda cmd中输入conda install -c conda-forge wordcloud  
2.步骤:读入背景图片,文本,实例化Wordcloud对象wc,  
wc.generate(text)产生云图,plt.imshow()显示图片参数:  
mask:遮罩图,字的大小布局和颜色都会依据遮罩图生成  
background_color:背景色,默认黑  
max_font_size:最大字号

##  nltk简单介绍

from nltk.corpus import stopwords  
如果stopwords报错没有安装,可以在anaconda cmd中import nltk;nltk.download()  
在弹出窗口中选择corpa,stopword,刷新并下载  
同理,在models选项卡中选择Punkt Tokenizer Model刷新并下载,可安装nltk.word_tokenize()分词:  
nltk.sent_tokenize(text) #对文本按照句子进行分割

nltk.word_tokenize(sent) #对句子进行分词

stopwords:个人理解是对表述不构成影响,大量存在,且可以直接过滤掉的词

#  参考文章:

[ what’s my score ](https://www.kaggle.com/ash316/what-s-my-score)  
[ TMDB means per genre ](https://www.kaggle.com/kkooijman/tmdb-means-per-
genre)

* * *

_新手学习,欢迎指教!_


![在这里插入图片描述](https://img-blog.csdnimg.cn/20210608151750993.gif)

你可能感兴趣的:(数据分析)