wx1871428

kaggle TMDB5000电影数据分析和电影推荐模型数据分析相关函数解释参考文章：

数据来自kaggle上tmdb5000电影数据集，本次数据分析主要包括电影数据可视化和简单的电影推荐模型，如：
1.电影类型分配及其随时间的变化
2.利润、评分、受欢迎程度直接的关系
3.哪些导演的电影卖座或较好
4.最勤劳的演职人员
5.电影关键字分析
6.电影相似性推荐

数据分析

    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    %matplotlib inline
    plt.style.use('ggplot')
    import json
    import warnings
    warnings.filterwarnings('ignore')#忽略警告
[/code]

```code
    movie = pd.read_csv('tmdb_5000_movies.csv')
    credit = pd.read_csv('tmdb_5000_credits.csv')
[/code]

```code
    movie.head(1)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [{“id”: 28, “name”: “Action”}, {“id”: 12, “nam…  |
http://www.avatarmovie.com/  |  19995  |  [{“id”: 1463, “name”: “culture
clash”}, {“id”:…  |  en  |  Avatar  |  In the 22nd century, a paraplegic
Marine is di…  |  150.437577  |  [{“name”: “Ingenious Film Partners”, “id”:
289…  |  [{“iso_3166_1”: “US”, “name”: “United States o…  |  2009-12-10  |
2787965087  |  162.0  |  [{“iso_639_1”: “en”, “name”: “English”}, {“iso…  |
Released  |  Enter the World of Pandora.  |  Avatar  |  7.2  |  11800

```code
    movie.tail(3)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
4800  |  0  |  [{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam…  |
http://www.hallmarkchannel.com/signedsealeddel…  |  231617  |  [{“id”: 248,
“name”: “date”}, {“id”: 699, “nam…  |  en  |  Signed, Sealed, Delivered  |
“Signed, Sealed, Delivered” introduces a dedic…  |  1.444476  |  [{“name”:
“Front Street Pictures”, “id”: 3958}…  |  [{“iso_3166_1”: “US”, “name”:
“United States o…  |  2013-10-13  |  0  |  120.0  |  [{“iso_639_1”: “en”,
“name”: “English”}]  |  Released  |  NaN  |  Signed, Sealed, Delivered  |  7.0
|  6  
4801  |  0  |  []  |  http://shanghaicalling.com/  |  126186  |  []  |  en  |
Shanghai Calling  |  When ambitious New York attorney Sam is sent t…  |
0.857008  |  []  |  [{“iso_3166_1”: “US”, “name”: “United States o…  |
2012-05-03  |  0  |  98.0  |  [{“iso_639_1”: “en”, “name”: “English”}]  |
Released  |  A New Yorker in Shanghai  |  Shanghai Calling  |  5.7  |  7  
4802  |  0  |  [{“id”: 99, “name”: “Documentary”}]  |  NaN  |  25975  |
[{“id”: 1523, “name”: “obsession”}, {“id”: 224…  |  en  |  My Date with Drew
|  Ever since the second grade when he first saw …  |  1.929883  |  [{“name”:
“rusty bear entertainment”, “id”: 87…  |  [{“iso_3166_1”: “US”, “name”:
“United States o…  |  2005-08-05  |  0  |  90.0  |  [{“iso_639_1”: “en”,
“name”: “English”}]  |  Released  |  NaN  |  My Date with Drew  |  6.3  |  16

```code
    movie.info()#样本数量为4803，部分特征有缺失值
[/code]

```code
    
    RangeIndex: 4803 entries, 0 to 4802
    Data columns (total 20 columns):
    budget                  4803 non-null int64
    genres                  4803 non-null object
    homepage                1712 non-null object
    id                      4803 non-null int64
    keywords                4803 non-null object
    original_language       4803 non-null object
    original_title          4803 non-null object
    overview                4800 non-null object
    popularity              4803 non-null float64
    production_companies    4803 non-null object
    production_countries    4803 non-null object
    release_date            4802 non-null object
    revenue                 4803 non-null int64
    runtime                 4801 non-null float64
    spoken_languages        4803 non-null object
    status                  4803 non-null object
    tagline                 3959 non-null object
    title                   4803 non-null object
    vote_average            4803 non-null float64
    vote_count              4803 non-null int64
    dtypes: float64(3), int64(4), object(13)
    memory usage: 750.5+ KB

样本数为4803，部分特征有缺失值，homepage,tagline缺损较多，但这俩不影响基本分析，release_date和runtime可以填充；仔细观察，部分样本的genres,keywords,production
company特征值是[]，需要注意。

    credit.info
[/code]

##  数据清理

数据特征中有很多特征为json格式，即类似于字典的键值对形式，为了方便后续处理，我们需要将其转换成便于python操作的str或者list形式，利于提取有用信息。

```code
    #movie genres电影流派，便于归类
    movie['genres']=movie['genres'].apply(json.loads)
    #apply function to axis in df,对df中某一行、列应用某种操作。
[/code]

```code
    movie['genres'].head(1)
[/code]

```code
    0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
    Name: genres, dtype: object

    list(zip(movie.index,movie['genres']))[:2]
[/code]

```code
    [(0,
      [{'id': 28, 'name': 'Action'},
       {'id': 12, 'name': 'Adventure'},
       {'id': 14, 'name': 'Fantasy'},
       {'id': 878, 'name': 'Science Fiction'}]),
     (1,
      [{'id': 12, 'name': 'Adventure'},
       {'id': 14, 'name': 'Fantasy'},
       {'id': 28, 'name': 'Action'}])]

    for index,i in zip(movie.index,movie['genres']):
        list1=[]
        for j in range(len(i)):
            list1.append((i[j]['name']))# name:genres,Action...
        movie.loc[index,'genres']=str(list1)
[/code]

```code
    movie.head(1)
    #genres列已经不是json格式，而是将name将的value即电影类型提取出来重新赋值给genres
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…  |
http://www.avatarmovie.com/  |  19995  |  [{“id”: 1463, “name”: “culture
clash”}, {“id”:…  |  en  |  Avatar  |  In the 22nd century, a paraplegic
Marine is di…  |  150.437577  |  [{“name”: “Ingenious Film Partners”, “id”:
289…  |  [{“iso_3166_1”: “US”, “name”: “United States o…  |  2009-12-10  |
2787965087  |  162.0  |  [{“iso_639_1”: “en”, “name”: “English”}, {“iso…  |
Released  |  Enter the World of Pandora.  |  Avatar  |  7.2  |  11800

```code
    #同样的方法应用到keywords列
    movie['keywords'] = movie['keywords'].apply(json.loads)
    for index,i in zip(movie.index,movie['keywords']):
        list2=[]
        for j in range(len(i)):
            list2.append(i[j]['name'])
        movie.loc[index,'keywords'] = str(list2)
[/code]

```code
    #同理production_companies
    movie['production_companies'] = movie['production_companies'].apply(json.loads)
    for index,i in zip(movie.index,movie['production_companies']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        movie.loc[index,'production_companies']=str(list3)
[/code]

```code
    movie['production_countries'] = movie['production_countries'].apply(json.loads)
    for index,i in zip(movie.index,movie['production_countries']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        movie.loc[index,'production_countries']=str(list3)
[/code]

```code
    movie['spoken_languages'] = movie['spoken_languages'].apply(json.loads)
    for index,i in zip(movie.index,movie['spoken_languages']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        movie.loc[index,'spoken_languages']=str(list3)
[/code]

```code
    movie.head(1)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…  |
http://www.avatarmovie.com/  |  19995  |  [‘culture clash’, ‘future’, ‘space
war’, ‘spac…  |  en  |  Avatar  |  In the 22nd century, a paraplegic Marine is
di…  |  150.437577  |  [‘Ingenious Film Partners’, ‘Twentieth Century…  |
[‘United States of America’, ‘United Kingdom’]  |  2009-12-10  |  2787965087
|  162.0  |  [‘English’, ‘Español’]  |  Released  |  Enter the World of
Pandora.  |  Avatar  |  7.2  |  11800

```code
    credit.head(1)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |
movie_id  |  title  |  cast  |  crew  
---|---|---|---|---  
0  |  19995  |  Avatar  |  [{“cast_id”: 242, “character”: “Jake Sully”, “…  |
[{“credit_id”: “52fe48009251416c750aca23”, “de…

```code
    credit['cast'] = credit['cast'].apply(json.loads)
    for index,i in zip(credit.index,credit['cast']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        credit.loc[index,'cast']=str(list3)
[/code]

```code
    credit['crew'] = credit['crew'].apply(json.loads)
    #提取crew中director，增加电影导演一列，用作后续分析
    def director(x):
        for i in x:
            if i['job'] == 'Director':
                return i['name']
    credit['crew']=credit['crew'].apply(director)
    credit.rename(columns={'crew':'director'},inplace=True)
[/code]

```code
    credit.head(1)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |
movie_id  |  title  |  cast  |  director  
---|---|---|---|---  
0  |  19995  |  Avatar  |  [‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney …  |
James Cameron  
  
观察movie中id和credit中movie_id相同，可以将两个表合并，将所有信息统一在一个表中。

```code
    fulldf = pd.merge(movie,credit,left_on='id',right_on='movie_id',how='left')
[/code]

```code
    fulldf.head(1)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |  …  |
spoken_languages  |  status  |  tagline  |  title_x  |  vote_average  |
vote_count  |  movie_id  |  title_y  |  cast  |  director  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…  |
http://www.avatarmovie.com/  |  19995  |  [‘culture clash’, ‘future’, ‘space
war’, ‘spac…  |  en  |  Avatar  |  In the 22nd century, a paraplegic Marine is
di…  |  150.437577  |  [‘Ingenious Film Partners’, ‘Twentieth Century…  |  …
|  [‘English’, ‘Español’]  |  Released  |  Enter the World of Pandora.  |
Avatar  |  7.2  |  11800  |  19995  |  Avatar  |  [‘Sam Worthington’, ‘Zoe
Saldana’, ‘Sigourney …  |  James Cameron  
  
1 rows × 24 columns

```code
    fulldf.shape
[/code]

(4803, 24)

```code
    #观察到有相同列title，合并后自动命名成title_x,title_y
    fulldf.rename(columns={'title_x':'title'},inplace=True)
    fulldf.drop('title_y',axis=1,inplace=True)
[/code]

```code
    #缺失值
    NAs = pd.DataFrame(fulldf.isnull().sum())
    NAs[NAs.sum(axis=1)>0].sort_values(by=[0],ascending=False)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  0  
---|---  
homepage  |  3091  
tagline  |  844  
director  |  30  
overview  |  3  
runtime  |  2  
release_date  |  1

```code
    #补充release_date
    fulldf.loc[fulldf['release_date'].isnull(),'title']
[/code]  
  
4553 America Is Still the Place Name: title, dtype: object

```code
    #上网查询补充
    fulldf['release_date']=fulldf['release_date'].fillna('2014-06-01')
[/code]

```code
    #runtime为电影时长，按均值补充
    fulldf['runtime'] = fulldf['runtime'].fillna(fulldf['runtime'].mean())
[/code]

```code
    #为方便分析，将release_date（object）转为datetime类型，并提取year,month
    fulldf['release_year'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.year
    fulldf['release_month'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.month
[/code]

##  数据探索

```code
    #电影类型genres
    #观察其格式，我们需要做str相关处理,先移除两边中括号
    #相邻类型间有空格，需要移除
    #再移除单引号，并按,分割提取即可
    fulldf['genres']=fulldf['genres'].str.strip('[]').str.replace(" ","").str.replace("'","")
[/code]

```code
    #每种类型现在以，分割
    fulldf['genres']=fulldf['genres'].str.split(',')
[/code]

```code
    list1=[]
    for i in fulldf['genres']:
        list1.extend(i)
    gen_list=pd.Series(list1).value_counts()[:10].sort_values(ascending=False)
    gen_df = pd.DataFrame(gen_list)
    gen_df.rename(columns={0:'Total'},inplace=True)
[/code]

```code
    fulldf.ix[4801]
[/code]

```code
      budget                                                                  0
    genres                                                                 []
    homepage                                      http://shanghaicalling.com/
    id                                                                 126186
    keywords                                                               []
    original_language                                                      en
    original_title                                           Shanghai Calling
    overview                When ambitious New York attorney Sam is sent t...
    popularity                                                       0.857008
    production_companies                                                   []
    production_countries                ['United States of America', 'China']
    release_date                                                   2012-05-03
    revenue                                                                 0
    runtime                                                                98
    spoken_languages                                              ['English']
    status                                                           Released
    tagline                                          A New Yorker in Shanghai
    title                                                    Shanghai Calling
    vote_average                                                          5.7
    vote_count                                                              7
    movie_id                                                           126186
    cast                    ['Daniel Henney', 'Eliza Coupe', 'Bill Paxton'...
    director                                                      Daniel Hsia
    release_year                                                         2012
    release_month                                                           5
    Name: 4801, dtype: object

    plt.subplots(figsize=(10,8))
    sns.barplot(y=gen_df.index,x='Total',data=gen_df,palette='GnBu_d')
    plt.xticks(fontsize=15)#设置刻度字体大小
    plt.yticks(fontsize=15)
    plt.xlabel('Total',fontsize=15)
    plt.ylabel('Genres',fontsize=15)
    plt.title('Top 10 Genres',fontsize=20)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523132516220?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

数量最多的前10种电影类型，有剧情、喜剧、惊悚、动作等，也是目前影院常见电影类型，那这些电影类型数量较多的背后原因有哪些呢？  
我们再看看电影数量和时间的关系。

```code
    #对电影类型去重
    l=[]
    for i in list1:
        if i not in l:
            l.append(i)
    #l.remove("")#有部分电影类型为空
    len(l)#l就是去重后的电影类型
[/code]

21

```code
    year_min = fulldf['release_year'].min()
    year_max = fulldf['release_year'].max()
    
    year_genr =pd.DataFrame(index=l,columns=range(year_min,year_max+1))#生成类型为index，年份为列的dataframe，用于每种类型在各年份的数量
    year_genr.fillna(value=0,inplace=True)#初始值为0
    
    
    intil_y = np.array(fulldf['release_year'])#用于遍历所有年份
    z = 0
    for i in fulldf['genres']:
        splt_gen = list(i)#每一部电影的所有类型
        for j in splt_gen:
            year_genr.loc[j,intil_y[z]] = year_genr.loc[j,intil_y[z]]+1#计数该类型电影在某一年份的数量
        z+=1

    year_genr = year_genr.sort_values(by=2006,ascending=False)
    year_genr = year_genr.iloc[0:10,-49:-1]
    year_genr
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  1969
|  1970  |  1971  |  1972  |  1973  |  1974  |  1975  |  1976  |  1977  |
1978  |  …  |  2007  |  2008  |  2009  |  2010  |  2011  |  2012  |  2013  |
2014  |  2015  |  2016  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
Drama  |  7  |  8  |  3  |  6  |  4  |  3  |  2  |  4  |  8  |  5  |  …  |  97
|  106  |  122  |  115  |  99  |  79  |  110  |  110  |  95  |  37  
Comedy  |  3  |  4  |  1  |  3  |  3  |  3  |  3  |  3  |  3  |  1  |  …  |
67  |  82  |  97  |  87  |  82  |  80  |  71  |  62  |  52  |  26  
Thriller  |  3  |  2  |  3  |  1  |  2  |  1  |  1  |  2  |  4  |  5  |  …  |
53  |  55  |  59  |  56  |  69  |  58  |  53  |  66  |  67  |  27  
Action  |  4  |  4  |  4  |  1  |  2  |  1  |  1  |  2  |  6  |  5  |  …  |
44  |  46  |  51  |  49  |  58  |  43  |  56  |  54  |  46  |  39  
Romance  |  2  |  1  |  3  |  0  |  1  |  2  |  1  |  2  |  2  |  3  |  …  |
37  |  38  |  57  |  45  |  30  |  39  |  25  |  24  |  23  |  9  
Family  |  0  |  0  |  1  |  0  |  0  |  1  |  0  |  1  |  0  |  1  |  …  |
20  |  29  |  28  |  29  |  28  |  17  |  22  |  23  |  17  |  9  
Crime  |  3  |  0  |  2  |  3  |  2  |  2  |  0  |  2  |  0  |  0  |  …  |  28
|  33  |  32  |  30  |  24  |  27  |  37  |  27  |  26  |  10  
Adventure  |  2  |  3  |  1  |  2  |  1  |  2  |  2  |  2  |  5  |  4  |  …  |
25  |  37  |  36  |  30  |  32  |  25  |  36  |  37  |  35  |  23  
Fantasy  |  0  |  0  |  1  |  0  |  0  |  0  |  1  |  0  |  2  |  2  |  …  |
19  |  20  |  22  |  21  |  15  |  19  |  21  |  16  |  10  |  13  
Horror  |  0  |  0  |  1  |  1  |  1  |  1  |  1  |  1  |  3  |  4  |  …  |
27  |  21  |  30  |  27  |  24  |  33  |  25  |  21  |  33  |  20  
  
10 rows × 48 columns

```code
    plt.subplots(figsize=(10,8))
    plt.plot(year_genr.T)
    plt.title('Genres vs Time',fontsize=20)
    plt.xticks(range(1969,2020,5))
    plt.legend(year_genr.T)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523132553536?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

可以看到，从1994年左右，电影进入繁荣发展时期，各种类型的电影均有大幅增加，而增加最多的又以剧情、喜剧、惊悚、动作等类型电影，可见，这些类型电影数量居多和电影艺术整体繁荣发展有一定关系。

```code
    #为了方便分析，构造一个新的dataframe,选取部分特征，分析这些特征和电影类型的关系。
    partdf = fulldf[['title','vote_average','vote_count','release_year','popularity','budget','revenue']].reset_index(drop=True)
[/code]

```code
    partdf.head(2)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  title
|  vote_average  |  vote_count  |  release_year  |  popularity  |  budget  |
revenue  
---|---|---|---|---|---|---|---  
0  |  Avatar  |  7.2  |  11800  |  2009  |  150.437577  |  237000000  |
2787965087  
1  |  Pirates of the Caribbean: At World’s End  |  6.9  |  4500  |  2007  |
139.082615  |  300000000  |  961000000  
  
因为一部电影可能有多种电影类型，将每种类型加入column中，对每部电影，是某种类型就赋值1，不是则赋值0

```code
    for per in l:
        partdf[per]=0
    
        z=0
        for gen in fulldf['genres']:
    
            if per in list(gen):
                partdf.loc[z,per] = 1
            else:
                partdf.loc[z,per] = 0
            z+=1
    partdf.head(2)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  title
|  vote_average  |  vote_count  |  release_year  |  popularity  |  budget  |
revenue  |  Action  |  Adventure  |  Fantasy  |  …  |  Romance  |  Horror  |
Mystery  |  History  |  War  |  Music  |  Documentary  |  Foreign  |  TVMovie
|  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  Avatar  |  7.2  |  11800  |  2009  |  150.437577  |  237000000  |
2787965087  |  1  |  1  |  1  |  …  |  0  |  0  |  0  |  0  |  0  |  0  |  0
|  0  |  0  |  0  
1  |  Pirates of the Caribbean: At World’s End  |  6.9  |  4500  |  2007  |
139.082615  |  300000000  |  961000000  |  1  |  1  |  1  |  …  |  0  |  0  |
0  |  0  |  0  |  0  |  0  |  0  |  0  |  0  
  
2 rows × 28 columns

现在我们想了解每种电影类型一些特征的平均值，创建一个新的dataframe，index就是电影类型，列是平均特征，如平分vote，收入revenue，受欢迎程度等。

```code
    mean_gen = pd.DataFrame(l)
[/code]

```code
    #点评分数取均值
    newArray = []*len(l)
    for genre in l:
        newArray.append(partdf.groupby(genre, as_index=True)['vote_average'].mean())
    #现在newArray中是按类型[0]平均值[1]平均值存放，我们只关心[1]的值。
    newArray2 = []*len(l)
    for i in range(len(l)):
        newArray2.append(newArray[i][1])
    
    mean_gen['mean_votes_average']=newArray2
    mean_gen.head(2)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  0  |
mean_votes_average  
---|---|---  
0  |  Action  |  5.989515  
1  |  Adventure  |  6.156962

```code
    #同理，用到别的特征上
    #预算budget
    newArray = []*len(l)
    for genre in l:
        newArray.append(partdf.groupby(genre, as_index=True)['budget'].mean())
    newArray2 = []*len(l)
    for i in range(len(l)):
        newArray2.append(newArray[i][1])
    
    mean_gen['mean_budget']=newArray2
[/code]

```code
    #收入revenue
    newArray = []*len(l)
    for genre in l:
        newArray.append(partdf.groupby(genre, as_index=True)['revenue'].mean())
    newArray2 = []*len(l)
    for i in range(len(l)):
        newArray2.append(newArray[i][1])
    
    mean_gen['mean_revenue']=newArray2
[/code]

```code
    #popularity:相关页面查看次数
    newArray = []*len(l)
    for genre in l:
        newArray.append(partdf.groupby(genre, as_index=True)['popularity'].mean())
    newArray2 = []*len(l)
    for i in range(len(l)):
        newArray2.append(newArray[i][1])
    
    mean_gen['mean_popular']=newArray2
[/code]

```code
    #vote_count:评分次数取count
    newArray = []*len(l)
    for genre in l:
        newArray.append(partdf.groupby(genre, as_index=True)['vote_count'].count())
    newArray2 = []*len(l)
    for i in range(len(l)):
        newArray2.append(newArray[i][1])
    
    mean_gen['vote_count']=newArray2
[/code]

```code
    mean_gen.rename(columns={0:'genre'},inplace=True)
    mean_gen.replace('','none',inplace=True)
    #none代表有些电影类型或其他特征有缺失，可以看到数量很小，我们将其舍得不考虑
    mean_gen.drop(20,inplace=True)
[/code]

```code
    mean_gen['vote_count'].describe()
[/code]  
  
count 20.000000  
mean 608.000000  
std 606.931974  
min 8.000000  
25% 174.750000  
50% 468.500000  
75% 816.000000  
max 2297.000000  
Name: vote_count, dtype: float64

```code
    mean_gen['mean_votes_average'].describe()
[/code]

count 20.000000  
mean 6.173921  
std 0.278476  
min 5.626590  
25% 6.009644  
50% 6.180978  
75% 6.344325  
max 6.719797  
Name: mean_votes_average, dtype: float64

```code
    #fig = plt.figure(figsize=(10, 8))
    f,ax = plt.subplots(figsize=(10,6))
    ax1 = f.add_subplot(111)
    ax2 = ax1.twinx()
    grid1 = sns.factorplot(x='genre', y='mean_votes_average',data=mean_gen,ax=ax1)
    ax1.axes.set_ylabel('votes_average')
    ax1.axes.set_ylim((4,7))
    
    grid2 = sns.factorplot(x='genre',y='mean_popular',data=mean_gen,ax=ax2,color='blue')
    ax2.axes.set_ylabel('popularity')
    ax2.axes.set_ylim((0,40))
    ax1.set_xticklabels(mean_gen['genre'],rotation=90)
    
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523132655277?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

从上图可知，外国电影并不受欢迎，虽然评分不低，但也是因为评分人数太少，动漫电影（Animation）、科幻（Science
Fiction）、奇幻电影（Fantasy）、动作片（Action）受欢迎程度较高，评分也不低，数量最多的剧情片评分很高，但受欢迎程度较低，猜测可能大部分剧情片不是商业类型。

```code
    mean_gen['profit'] = mean_gen['mean_revenue']-mean_gen['mean_budget']
[/code]

```code
    s = mean_gen['profit'].sort_values(ascending=False)[:10]
    pdf = mean_gen.ix[s.index]
    
    plt.subplots(figsize=(10,6))
    sns.barplot(x='profit',y='genre',data=pdf,palette='BuGn_r')
    plt.xticks(fontsize=15)#设置刻度字体大小
    plt.yticks(fontsize=15)
    plt.xlabel('Profit',fontsize=15)
    plt.ylabel('Genres',fontsize=15)
    plt.title('Top 10 Profit of Genres',fontsize=20)
    
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523132747502?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

可以看出，动画、探险、家庭和科幻是最赚钱的电影类型，适合去电影院观看，同时也是受欢迎的类型，那么我们看看变量的关系。

```code
    cordf = partdf.drop(l,axis=1)
    cordf.columns#含有我们想了解的特征，适合分析
[/code]

```code
     Index(['title', 'vote_average', 'vote_count', 'release_year', 'popularity',
           'budget', 'revenue'],
          dtype='object')

    corrmat = cordf.corr()
    f, ax = plt.subplots(figsize=(10,7))
    sns.heatmap(corrmat,cbar=True, annot=True,vmax=.8, cmap='PuBu',square=True)
[/code]

![png](https://img-
blog.csdn.net/20180523132843861?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

从上图可以看出，评分次数和受欢迎程度有比较强的关系，证明看的人多参与度也高，预算和票房也关系较强，票房和受欢迎程度、评分次数也有比较强的关系，为电影做好宣传很重要，我们再进一步看一下。

```code
    #budget, revenue在数据中都有为0的项，我们去除这些脏数据，
    partdf = partdf[partdf['budget']>0]
    partdf = partdf[partdf['revenue']>0]
    partdf = partdf[partdf['vote_count']>3]
    plt.subplots(figsize=(6,5))
    
    plt.xlabel('Budget',fontsize=15)
    plt.ylabel('Revenue',fontsize=15)
    plt.title('Budget vs Revenue',fontsize=20)
    sns.regplot(x='budget',y='revenue',data=partdf,ci=None)
[/code]

![png](https://img-
blog.csdn.net/20180523132916443?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

```code
    plt.subplots(figsize=(6,5))
    plt.xlabel('vote_average',fontsize=15)
    plt.ylabel('popularity',fontsize=15)
    plt.title('Score vs Popular',fontsize=20)
    sns.regplot(x='vote_average',y='popularity',data=partdf)
[/code]

![png](https://img-
blog.csdn.net/20180523132946451?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

可以看出，成本和票房、评分高低和受欢迎程度还是呈线性关系的。但成本较低的电影，成本对票房的影响不大，评分高的的电影基本上也很受欢迎，我们再看看究竟是哪几部电影最挣钱、最受欢迎、口碑最好。

```code
    print(partdf.loc[partdf['revenue']==partdf['revenue'].max()]['title'])
    print(partdf.loc[partdf['popularity']==partdf['popularity'].max()]['title'])
    print(partdf.loc[partdf['vote_average']==partdf['vote_average'].max()]['title'])
[/code]

0 Avatar  
Name: title, dtype: object  
546 Minions  
Name: title, dtype: object  
1881 The Shawshank Redemption  
Name: title, dtype: object

```code
    partdf['profit'] = partdf['revenue']-partdf['budget']
    print(partdf.loc[partdf['profit']==partdf['profit'].max()]['title'])
[/code]

0 Avatar  
Name: title, dtype: object

小黄人电影最受欢迎，阿凡达最赚钱，肖申克的救赎口碑最好。

```code
    s1 = cordf.groupby(by='release_year').budget.sum()
    s2 = cordf.groupby(by='release_year').revenue.sum()
    sdf = pd.concat([s1,s2],axis=1)
    sdf = sdf.iloc[-39:-2]
    plt.plot(sdf)
    plt.xticks(range(1979,2020,5))
    plt.legend(sdf)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523133047409?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

电影业果然是蓬勃发展啊！现在大制作的电影越来越多，看来是有原因的啊！

对于科幻迷们，也可以看看最受欢迎的科幻电影都有哪些：

```code
    #最受欢迎的科幻电影
    s = partdf.loc[partdf['ScienceFiction']==1,'popularity'].sort_values(ascending=False)[:10]
    sdf = partdf.ix[s.index]
    sns.barplot(x='popularity',y='title',data=sdf)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523133107982?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

星际穿越最受欢迎，银河护卫队紧随其后，同理，我们也可以了解其他电影类型的情况。现在。让我们再看看电影人对电影市场的影响，一部好电影离不开台前幕后工作人员的贡献，是每一位优秀的电影人为我们带来好看的电影，这里，我们主要分析导演和演员。

```code
    #平均票房最高的导演
    rev_d = fulldf.groupby('director')['revenue'].mean()
    top_rev_d = rev_d.sort_values(ascending=False).head(20)
    top_rev_d = pd.DataFrame(top_rev_d)

    plt.subplots(figsize=(10,6))
    sns.barplot(x='revenue',y=top_rev_d.index,data=top_rev_d,palette='BuGn_r')
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15)
    plt.xlabel('Average Revenue',fontsize=15)
    plt.ylabel('Director',fontsize=15)
    plt.title('Top 20 Revenue by Director',fontsize=20)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523133135909?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

如图是市场好的导演，那么电影产量最高、或者既叫好又叫座的导演有哪些呢？

```code
    list2 = fulldf[fulldf['director']!=''].director.value_counts()[:10].sort_values(ascending=True)
    list2 = pd.Series(list2)
    list2
[/code]

Oliver Stone 14  
Renny Harlin 15  
Steven Soderbergh 15  
Robert Rodriguez 16  
Spike Lee 16  
Ridley Scott 16  
Martin Scorsese 20  
Clint Eastwood 20  
Woody Allen 21  
Steven Spielberg 27  
Name: director, dtype: int64

```code
    plt.subplots(figsize=(10,6))
    ax = list2.plot.barh(width=0.85,color='y')
    for i,v in enumerate(list2.values):
        ax.text(.5, i, v,fontsize=12,color='white',weight='bold')
    ax.patches[9].set_facecolor('g')
    plt.title('Directors with highest movies')
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/2018052313320860?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

```code
    top_vote_d = fulldf[fulldf['vote_average']>=8].sort_values(by='vote_average',ascending=False)
    top_vote_d = top_vote_d.dropna()
    top_vote_d = top_vote_d.loc[:,['director','vote_average']]
[/code]

```code
    tmp = rev_d.sort_values(ascending=False)
    vote_rev_d = tmp[tmp.index.isin(list(top_vote_d['director']))]
    vote_rev_d = vote_rev_d.sort_values(ascending=False)
    vote_rev_d = pd.DataFrame(vote_rev_d)
[/code]

```code
    plt.subplots(figsize=(10,6))
    sns.barplot(x='revenue',y=vote_rev_d.index,data=vote_rev_d,palette='BuGn_r')
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15)
    plt.xlabel('Average Revenue',fontsize=15)
    plt.ylabel('Director',fontsize=15)
    plt.title('Revenue by vote above 8 Director',fontsize=20)
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/2018052313323079?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

再看看演职人员，cast特征里每一部电影有很多演职人员，幸运的是，cast是按演职人员的重要程度排序的，那么排名靠前的我们可以认为是主要演员。

```code
    fulldf['cast']=fulldf['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
    fulldf['cast']=fulldf['cast'].str.split(',')

    list1=[]
    for i in fulldf['cast']:
        list1.extend(i)
    list1 = pd.Series(list1)
    list1 = list1.value_counts()[:15].sort_values(ascending=True)

    plt.subplots(figsize=(10,6))
    ax = list1.plot.barh(width=0.9,color='green')
    for i,v in enumerate(list1.values):
        ax.text(.8, i, v,fontsize=10,color='white',weight='bold')
    plt.title('Actors with highest appearance')
    ax.patches[14].set_facecolor('b')
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/2018052313325049?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

```code
    fulldf['keywords'][2]
[/code]

“[‘spy’, ‘based on novel’, ‘secret agent’, ‘sequel’, ‘mi6’, ‘british secret
service’, ‘united kingdom’]”

```code
    from wordcloud import WordCloud, STOPWORDS
    import nltk
    from nltk.corpus import stopwords
    #如果stopwords报错没有安装，可以在anaconda cmd中import nltk;nltk.download()
    #在弹出窗口中选择corpa,stopword,刷新并下载
    import io
    from PIL import Image
[/code]

```code
    plt.subplots(figsize=(12,12))
    stop_words=set(stopwords.words('english'))
    stop_words.update(',',';','!','?','.','(',')','$','#','+',':','...',' ','')
    
    img1 = Image.open('timg1.jpg')
    hcmask1 = np.array(img1)
    words=fulldf['keywords'].dropna().apply(nltk.word_tokenize)
    word=[]
    for i in words:
        word.extend(i)
    word=pd.Series(word)
    word=([i for i in word.str.lower() if i not in stop_words])
    wc = WordCloud(background_color="black", max_words=4000, mask=hcmask1,
                   stopwords=STOPWORDS, max_font_size= 60)
    wc.generate(" ".join(word))
    
    
    plt.imshow(wc,interpolation="bilinear")
    plt.axis('off')
    plt.figure()
    plt.show()
[/code]

![png](https://img-
blog.csdn.net/20180523133325401?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

我们可以对关键词有大概了解，女性导演、独立电影占比较大，这也可能是电影的一个发展趋势。

##  电影推荐模型

现在我们根据上述的分析，可以考虑做一个电影推荐，通常来说，我们在搜索电影时，我们会去找同类的电影、或者同一导演演员的电影、或者评分较高的电影，那么需要的特征有genres,cast,director,score

```code
    l[:5]
[/code]

[‘Action’, ‘Adventure’, ‘Fantasy’, ‘ScienceFiction’, ‘Crime’]

###  特征向量化

####  genre

```code
    def binary(genre_list):
        binaryList = []
    
        for genre in l:
            if genre in genre_list:
                binaryList.append(1)
            else:
                binaryList.append(0)
    
        return binaryList
[/code]

```code
    fulldf['genre_vec'] = fulldf['genres'].apply(lambda x: binary(x))
[/code]

```code
    fulldf['genre_vec'][0]
[/code]

[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

####  cast

```code
    for i,j in zip(fulldf['cast'],fulldf.index):
        list2=[]
        list2=i[:4]
        list2.sort()
        fulldf.loc[j,'cast']=str(list2)
    fulldf['cast'][0]
[/code]

“[‘SamWorthington’, ‘SigourneyWeaver’, ‘StephenLang’, ‘ZoeSaldana’]”

```code
    fulldf['cast']=fulldf['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'')
    fulldf['cast']=fulldf['cast'].str.split(',')
    fulldf['cast'][0]
[/code]

[‘SamWorthington’, ‘SigourneyWeaver’, ‘StephenLang’, ‘ZoeSaldana’]

```code
    castList = []
    for index, row in fulldf.iterrows():
        cast = row["cast"]
        for i in cast:
            if i not in castList:
                castList.append(i)
[/code]

```code
    len(castList)
[/code]

7515

```code
    def binary(cast_list):
        binaryList = []
    
        for genre in castList:
            if genre in cast_list:
                binaryList.append(1)
            else:
                binaryList.append(0)
    
        return binaryList
[/code]

```code
    fulldf['cast_vec'] = fulldf['cast'].apply(lambda x:binary(x))
    fulldf['cast_vec'].head(2)
[/code]

0 [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  
1 [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …  
Name: cast_vec, dtype: object

####  director

```code
    fulldf['director'][0]
[/code]

‘James Cameron’

```code
    def xstr(s):
        if s is None:
            return ''
        return str(s)
    fulldf['director']=fulldf['director'].apply(xstr)
[/code]

```code
    directorList=[]
    for i in fulldf['director']:
        if i not in directorList:
            directorList.append(i)
[/code]

```code
    def binary(director_list):
        binaryList = []
    
        for direct in directorList:
            if direct in director_list:
                binaryList.append(1)
            else:
                binaryList.append(0)
    
        return binaryList
[/code]

```code
    fulldf['director_vec'] = fulldf['director'].apply(lambda x:binary(x))
[/code]

####  keywords

```code
    fulldf['keywords'][0]
[/code]

“[‘culture clash’, ‘future’, ‘space war’, ‘space colony’, ‘society’, ‘space
travel’, ‘futuristic’, ‘romance’, ‘space’, ‘alien’, ‘tribe’, ‘alien planet’,
‘cgi’, ‘marine’, ‘soldier’, ‘battle’, ‘love affair’, ‘anti war’, ‘power
relations’, ‘mind and soul’, ‘3d’]”

```code
    #change keywords to type list
    fulldf['keywords']=fulldf['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
    fulldf['keywords']=fulldf['keywords'].str.split(',')
[/code]

```code
    for i,j in zip(fulldf['keywords'],fulldf.index):
        list2=[]
        list2 = i
        list2.sort()
        fulldf.loc[j,'keywords']=str(list2)
    fulldf['keywords'][0]
[/code]

“[‘3d’, ‘alien’, ‘alienplanet’, ‘antiwar’, ‘battle’, ‘cgi’, ‘cultureclash’,
‘future’, ‘futuristic’, ‘loveaffair’, ‘marine’, ‘mindandsoul’,
‘powerrelations’, ‘romance’, ‘society’, ‘soldier’, ‘space’, ‘spacecolony’,
‘spacetravel’, ‘spacewar’, ‘tribe’]”

```code
    fulldf['keywords']=fulldf['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
    fulldf['keywords']=fulldf['keywords'].str.split(',')
[/code]

```code
    words_list = []
    for index, row in fulldf.iterrows():
        genres = row["keywords"]
    
        for genre in genres:
            if genre not in words_list:
                words_list.append(genre)
    len(words_list)
[/code]

9772

```code
    def binary(words):
        binaryList = []
    
        for genre in words_list:
            if genre in words:
                binaryList.append(1)
            else:
                binaryList.append(0)
    
        return binaryList
[/code]

```code
    fulldf['words_vec'] = fulldf['keywords'].apply(lambda x: binary(x))
[/code]

####  recommend model

取余弦值作为相似性度量，根据选取的特征向量计算影片间的相似性；计算距离最近的前10部影片作为推荐

```code
    fulldf=fulldf[(fulldf['vote_average']!=0)] #removing the fulldf with 0 score and without drector names 
    fulldf=fulldf[fulldf['director']!='']
[/code]

```code
    from scipy import spatial
    
    def Similarity(movieId1, movieId2):
        a = fulldf.iloc[movieId1]
        b = fulldf.iloc[movieId2]
    
        genresA = a['genre_vec']
        genresB = b['genre_vec']
        genreDistance = spatial.distance.cosine(genresA, genresB)
    
        castA = a['cast_vec']
        castB = b['cast_vec']
        castDistance = spatial.distance.cosine(castA, castB)
    
        directA = a['director_vec']
        directB = b['director_vec']
        directDistance = spatial.distance.cosine(directA, directB)
    
        wordsA = a['words_vec']
        wordsB = b['words_vec']
        wordsDistance = spatial.distance.cosine(directA, directB)
        return genreDistance + directDistance + castDistance + wordsDistance
[/code]

```code
    Similarity(3,160)
[/code]

2.7958758547680684

```code
    columns =['original_title','genres','vote_average','genre_vec','cast_vec','director','director_vec','words_vec']
    tmp = fulldf.copy()
    tmp =tmp[columns]
    tmp['id'] = list(range(0,fulldf.shape[0]))
    tmp.head()
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |
original_title  |  genres  |  vote_average  |  genre_vec  |  cast_vec  |
director  |  director_vec  |  words_vec  |  id  
---|---|---|---|---|---|---|---|---|---  
0  |  Avatar  |  [Action, Adventure, Fantasy, ScienceFiction]  |  7.2  |  [1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, …  |  James Cameron  |  [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, …  |  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …  |  0  
1  |  Pirates of the Caribbean: At World’s End  |  [Adventure, Fantasy,
Action]  |  6.9  |  [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0,
0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …  |  Gore Verbinski  |  [0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, …  |  1  
2  |  Spectre  |  [Action, Adventure, Crime]  |  6.3  |  [1, 1, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, …
|  Sam Mendes  |  [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  2  
3  |  The Dark Knight Rises  |  [Action, Crime, Drama, Thriller]  |  7.6  |
[1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, …  |  Christopher Nolan  |  [0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, …  |  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  3  
4  |  John Carter  |  [Action, Adventure, ScienceFiction]  |  6.1  |  [1, 1,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, …  |  Andrew Stanton  |  [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, …  |  [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …  |  4

```code
    tmp.isnull().sum()
[/code]  
  
original_title 0  
genres 0  
vote_average 0  
genre_vec 0  
cast_vec 0  
director 0  
director_vec 0  
words_vec 0  
id 0  
dtype: int64

```code
    import operator
    def recommend(name):
        film=tmp[tmp['original_title'].str.contains(name)].iloc[0].to_frame().T
        print('Selected Movie: ',film.original_title.values[0])
        def getNeighbors(baseMovie):
            distances = []
            for index, movie in tmp.iterrows():
                if movie['id'] != baseMovie['id'].values[0]:
                    dist = Similarity(baseMovie['id'].values[0], movie['id'])
                    distances.append((movie['id'], dist))
    
            distances.sort(key=operator.itemgetter(1))
    
            neighbors = []
            for x in range(10):
                neighbors.append(distances[x])
            return neighbors
        neighbors = getNeighbors(film)
        print('\nRecommended Movies: \n')
    
        for nei in neighbors:  
            print( tmp.iloc[nei[0]][0]+" | Genres: "+
                  str(tmp.iloc[nei[0]][1]).strip('[]').replace(' ','')+" | Rating: "
                  +str(tmp.iloc[nei[0]][2]))
    
        print('\n')
[/code]

```code
    recommend('Godfather')
[/code]

Selected Movie: The Godfather: Part III

Recommended Movies:

```code
    The Godfather: Part II | Genres: 'Drama','Crime' | Rating: 8.3
    The Godfather | Genres: 'Drama','Crime' | Rating: 8.4
    The Rainmaker | Genres: 'Drama','Crime','Thriller' | Rating: 6.7
    The Outsiders | Genres: 'Crime','Drama' | Rating: 6.9
    The Conversation | Genres: 'Crime','Drama','Mystery' | Rating: 7.5
    The Cotton Club | Genres: 'Music','Drama','Crime','Romance' | Rating: 6.6
    Apocalypse Now | Genres: 'Drama','War' | Rating: 8.0
    Twixt | Genres: 'Horror','Thriller' | Rating: 5.0
    New York Stories | Genres: 'Comedy','Drama','Romance' | Rating: 6.2
    Peggy Sue Got Married | Genres: 'Comedy','Drama','Fantasy','Romance' | Rating: 5.9

相关函数解释

json格式处理

json是一种数据交换格式，以键值对的形式呈现，支持任何类型

json.loads用于解码json格式，将其转为dict;
其逆操作，即转为json格式，是json.dumps(),若要存储为json文件，需要先dumps转换再写入
json.dump()用于将dict类型的数据转成str，并写入到json文件中，json.dump(json,file)
json.load()用于从json文件中读取数据。json.load(file)

    exam = {'a':'1111','b':'2222','c':'3333','d':'4444'}
    file = 'exam.json'
    jsobj = json.dumps(exam)
    # solution 1
    with open(file,'w') as f:
        f.write(jsobj)
        f.close()
    #solution 2
    json.dump(exam,open(file,'w'))
[/code]

##  zip()操作

  * zip()操作：用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的列表。 
  * 其逆操作为*zip(),举例如下： 

```code
    a = [1,2,3]
    b = [4,5,6]
    c = [4,5,6,7,8]
    zipped = zip(a,b)
    for i in zipped:
        print(i)
    print('\n')
    shor_z = zip(a,c)
    for j in shor_z:#取最短
        print(j)
[/code]

(1, 4) (2, 5) (3, 6) (1, 4) (2, 5) (3, 6)

```code
    z=list(zip(a,b))
    z
[/code]

[(1, 4), (2, 5), (3, 6)]

```code
    list(zip(*z))#转为list能看见
[/code]

[(1, 2, 3), (4, 5, 6)]

##  pandas merge/rename

pd.merge()通过键合并

```code
    a=pd.DataFrame({'lkey':['foo','foo','bar','bar'],'value':[1,2,3,4]})
    a
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  lkey
|  value  
---|---|---  
0  |  foo  |  1  
1  |  foo  |  2  
2  |  bar  |  3  
3  |  bar  |  4

```code
    for index,row in a.iterrows():
        print(index)
        print('*****')
        print(row)
[/code]  
  
0 ***** lkey foo value 1 Name: 0, dtype: object 1 ***** lkey foo value 2 Name:
1, dtype: object 2 ***** lkey bar value 3 Name: 2, dtype: object 3 ***** lkey
bar value 4 Name: 3, dtype: object

```code
    b=pd.DataFrame({'rkey':['foo','foo','bar','bar'],'value':[5,6,7,8]})
    b
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  rkey
|  value  
---|---|---  
0  |  foo  |  5  
1  |  foo  |  6  
2  |  bar  |  7  
3  |  bar  |  8

```code
    pd.merge(a,b,left_on='lkey',right_on='rkey',how='left')
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  lkey
|  value_x  |  rkey  |  value_y  
---|---|---|---|---  
0  |  foo  |  1  |  foo  |  5  
1  |  foo  |  1  |  foo  |  6  
2  |  foo  |  2  |  foo  |  5  
3  |  foo  |  2  |  foo  |  6  
4  |  bar  |  3  |  bar  |  7  
5  |  bar  |  3  |  bar  |  8  
6  |  bar  |  4  |  bar  |  7  
7  |  bar  |  4  |  bar  |  8

```code
    pd.merge(a,b,left_on='lkey',right_on='rkey',how='inner')
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  lkey
|  value_x  |  rkey  |  value_y  
---|---|---|---|---  
0  |  foo  |  1  |  foo  |  5  
1  |  foo  |  1  |  foo  |  6  
2  |  foo  |  2  |  foo  |  5  
3  |  foo  |  2  |  foo  |  6  
4  |  bar  |  3  |  bar  |  7  
5  |  bar  |  3  |  bar  |  8  
6  |  bar  |  4  |  bar  |  7  
7  |  bar  |  4  |  bar  |  8  
  
pd.rename()对行列重命名

```code
    dframe= pd.DataFrame(np.arange(12).reshape((3, 4)),
                     index=['NY', 'LA', 'SF'],
                     columns=['A', 'B', 'C', 'D'])
    dframe
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  A  |
B  |  C  |  D  
---|---|---|---|---  
NY  |  0  |  1  |  2  |  3  
LA  |  4  |  5  |  6  |  7  
SF  |  8  |  9  |  10  |  11

```code
    dframe.rename(columns={'A':'alpha'})
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  alpha
|  B  |  C  |  D  
---|---|---|---|---  
NY  |  0  |  1  |  2  |  3  
LA  |  4  |  5  |  6  |  7  
SF  |  8  |  9  |  10  |  11

```code
    dframe
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  A  |
B  |  C  |  D  
---|---|---|---|---  
NY  |  0  |  1  |  2  |  3  
LA  |  4  |  5  |  6  |  7  
SF  |  8  |  9  |  10  |  11

```code
    dframe.rename(columns={'A':'alpha'},inplace=True)
    dframe
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  alpha
|  B  |  C  |  D  
---|---|---|---|---  
NY  |  0  |  1  |  2  |  3  
LA  |  4  |  5  |  6  |  7  
SF  |  8  |  9  |  10  |  11  
  
##  pandas datetime格式

pandas to_datetime()转为datetime格式

##  Wordcloud

wordcloud词云模块：  
1.安装：在conda cmd中输入conda install -c conda-forge wordcloud  
2.步骤：读入背景图片，文本，实例化Wordcloud对象wc，  
wc.generate(text)产生云图，plt.imshow()显示图片参数：  
mask：遮罩图，字的大小布局和颜色都会依据遮罩图生成  
background_color：背景色，默认黑  
max_font_size：最大字号

##  nltk简单介绍

from nltk.corpus import stopwords  
如果stopwords报错没有安装，可以在anaconda cmd中import nltk;nltk.download()  
在弹出窗口中选择corpa,stopword,刷新并下载  
同理，在models选项卡中选择Punkt Tokenizer Model刷新并下载，可安装nltk.word_tokenize()分词：  
nltk.sent_tokenize(text) #对文本按照句子进行分割

nltk.word_tokenize(sent) #对句子进行分词

stopwords:个人理解是对表述不构成影响，大量存在，且可以直接过滤掉的词

#  参考文章：

[ what’s my score ](https://www.kaggle.com/ash316/what-s-my-score)  
[ TMDB means per genre ](https://www.kaggle.com/kkooijman/tmdb-means-per-
genre)

* * *

_新手学习，欢迎指教！_


![在这里插入图片描述](https://img-blog.csdnimg.cn/20210608151750993.gif)

你可能感兴趣的:(数据分析)

Spark Livy 指南及livy部署访问实践 house.zhang 大数据-Spark 大数据
背景：ApacheSpark是一个比较流行的大数据框架、广泛运用于数据处理、数据分析、机器学习中，它提供了两种方式进行数据处理，一是交互式处理：比如用户使用spark-shell，编写交互式代码编译成spark作业提交到集群上去执行；二是批处理，通过spark-submit提交打包好的spark应用jar到集群中进行执行。这两种运行方式都需要安装spark客户端配置好yarn集群信息，并打通集群网
MATLAB语言的数据库交互 Quantum&Coder 包罗万象 golang 开发语言后端
MATLAB语言的数据库交互引言在当今数据驱动的时代，掌握数据库的使用和管理是非常重要的。MATLAB作为一种强大的数值计算和数据分析工具，广泛应用于科学研究、工程设计和数据分析等领域。为了更有效地处理和分析数据，MATLAB提供了与各类数据库交互的功能。本文将探讨MATLAB语言如何与数据库进行交互，包括连接数据库、执行SQL查询、读取和写入数据等基本操作，并结合实例进行详细说明，以帮助读者理解
Python-玩转数据-数据分析之分析思维人猿宇宙数据分析 python big data
一、说明当下时代的社会生产发展，人们都开始习惯于用数据来说明某个观点和反映事物的内在规律或享用自动化和人工智能带来的便利。但这些轻松快捷的方便背后，都是相关工作者的专业流程作为源源不断的支撑。二、大数据思维自从几年前大数据开始兴起，大数据思维已经逐渐被更动的人接受，随着其进一步发展，产生了巨大的生产效果。三、数据驱动的生产力作为一个数据工程师，仅仅知道跑数据是不够的，还需要通过数据发现生产环节出现
一文了解数字孪生是什么？数字孪生赋能哪些行业应用场景橙子吖21 数字孪生区块链人工智能数学建模交互
导语数字孪生是物理系统向信息空间映射的关键技术，通过传感器和数据分析实现实时模拟和控制。与元宇宙不同，数字孪生强调物理对象的复现，是元宇宙的技术基础。NewIT技术支撑数字孪生的广泛应用，助力工业、城市等多领域实现虚拟与现实融合，促进经济社会创新发展。01什么是数字孪生？数字孪生，英文名为DiditalTwin(数字双胞胎)，也成为数字映射、数字镜像。它的官方定义非常复杂，是这么说的：是充分利用物
【数据分析（二）】初探 Pandas dandellion_ Python语法数据分析 pandas 数据挖掘
目录引言1.基本数据结构1.1.Series的初始化和简单操作1.2.DataFrame的初始化和简单操作1.2.1.初始化与持久化1.2.2.读取查看1.2.3.行操作1.2.4.列操作1.2.5.选中筛查2.数据预处理2.0.生成样例表2.1.缺失值处理2.2.类型转换和排序2.3.统计分析3.数据透视3.0.生成样例表3.1.生成透视表4.数据重塑4.1.层次化索引4.1.1.双层索引的Se
数字孪生：物联+数据打造洞察世界新视角 CServer_01 数字孪生模拟仿真工业软件
引言：数字孪生是物理系统向信息空间映射的关键技术，通过传感器、数据分析、物联网，实现实时模拟和控制。新一代信息技术支撑数字孪生的广泛应用，使其在工业、城市、交通、医疗、水利等多领域实现虚拟与现实融合，促进经济社会创新发展。如果，您可以打造任何物品、场景、城市的另一种表达形式。就如同打开上帝视角一样，可以随时随地及时监控物它的性能，预测物品的状况，并提高其效率。这种实时、持续的信息更新、交换，使得您
Python人工智能在气象中的应用，包括：天气预测、气候模拟、降雨量和降水预测、气象数据分析、气象预警系统 xiao5kou4chang6kai4 气象气候预报天气预测气候模拟.降雨量和降水预测气象数据分析气象预警系统 python
Python人工智能在气象中有多种应用，包括：天气预测、气候模拟、降雨量和降水预测、气象数据分析、气象预警系统Python是功能强大、免费、开源，实现面向对象的编程语言，在数据处理、科学计算、数学建模、数据挖掘和数据可视化方面具备优异的性能，这些优势使得Python在气象、海洋、地理、气候、水文和生态等地学领域的科研和工程项目中得到广泛应用。可以预见未来Python将成为的主流编程语言之一。人工智
ChatGPT4.0最新功能和使用技巧，助力日常生活、学习与工作！ WangYan2022 教程人工智能 chatgpt 数据分析 ai绘画 AI写作
熟练掌握ChatGPT4.0在数据分析、自动生成代码等方面的强大功能，系统学习人工智能（包括传统机器学习、深度学习等）的基础理论知识，以及具体的代码实现方法，同时掌握ChatGPT4.0在科研工作中的各种使用方法与技巧，以及人工智能领域经典机器学习算法（BP神经网络、支持向量机、决策树、随机森林、变量降维与特征选择、群优化算法等）和热门深度学习方法（卷积神经网络、迁移学习、RNN与LSTM神经网络
体育比分网站搭建的常规流程参考教程翱翔的猪脑花信息可视化
一、项目策划与需求分析在启动体育比分直播网站搭建项目前，首要任务是对市场进行深入的研究与分析，考察现有竞品的优势Atlaslive与CAF与不足，找准目标用户群体的需求痛点。例如，用户可能关注实时比分更新的速度与精确度，全面的赛事覆盖范围，深度的数据分析，以及便利的社交互动功能等。基于此，明确网站的定位和特色，设计出包括实时比分直播、赛事前瞻与回顾、详尽数据分析、体育新闻报道、互动社区等在内的核心
构建高效GPU算力平台：挑战、策略与未来展望 Mr' 郑 gpu算力
引言随着深度学习、高性能计算和大数据分析等领域的快速发展，GPU（图形处理器）因其强大的并行计算能力和浮点运算速度而成为首选的计算平台。然而，随着模型规模的增长和技术的进步，构建高效稳定的GPU算力平台面临着新的挑战。本文旨在探讨这些挑战、应对策略以及对未来发展的展望。当前挑战算力分配与资源优化在多用户共享GPU集群的环境下，合理分配计算资源并确保每个任务能够高效运行是一项挑战。这不仅涉及到硬件资
数据分析及应用：经营分析中的综合指标解析与应用莫叫石榴姐收获不止一点大数据数据分析机器学习
目录1.市场份额（MarketShare）2.客户获取成本（CustomerAcquisitionCost,CAC）3.客户生命周期价值（CustomerLifetimeValue,CLV）4.客户留存率（CustomerRetentionRate,CRR）5.净推荐值（NetPromoterScore,NPS）6.转化率（ConversionRate）7.平均订单价值（AverageOrderV
全面解读 Databricks：从架构、引擎到优化策略克里斯蒂亚诺罗纳尔多阿维罗架构 spark 大数据
导语：Databricks是一家由ApacheSpark创始团队成员创立的公司，同时也是一个统一分析平台，帮助企业构建数据湖与数据仓库一体化（Lakehouse）的架构。在Databricks平台上，数据工程、数据科学与数据分析团队能够协作使用Spark、DeltaLake、MLflow等工具高效处理数据与构建机器学习应用。本文将深入介绍Databricks的平台概念、架构特点、优化机制、功能特性
无效数据，你会怎么处理？网络安全我来了 IT技术无效数据
如何处理无效数据？无效数据就像海洋中的漂流物，易被忽视，却可能对你的数据分析产生深远的影响。在这个瞬息万变的数字世界中，数据已经成为了决策的核心。但你是否曾想过，无效数据会如何悄然破坏你的洞察力？在这篇文章中，我们将深入探讨如何识别和处理无效数据，确保你的分析能够真正反映现实的情况。无效数据的定义与重要性什么是无效数据？无效数据是指在数据集中不符合预期的数据，它可能是错误的、不完整的、重复的，甚至
14-美妆数据分析 william_liu1 数据分析数据分析数据挖掘
前言美妆数据分析可以帮助企业更好地理解市场趋势、客户偏好和产品表现importpandasaspdimportnumpyasnp一、数据清洗data=pd.read_csv(r'C:\Users\B\Desktop\美妆数据.csv',encoding='gbk')data.head()data.info()data=data.drop_duplicates(inplace=False)data.
基于Python大数据的王者荣耀战队数据分析及可视化系统计算机学姐大数据精选实战项目源码 Python精选实战项目源码 Vue源码 1024程序员节 python 大数据数据分析数据挖掘 django vue.js
作者：计算机学姐开发技术：SpringBoot、SSM、Vue、MySQL、JSP、ElementUI、Python、小程序等，“文末源码”。专栏推荐：前后端分离项目源码、SpringBoot项目源码、Vue项目源码、SSM项目源码、微信小程序源码精品专栏：Java精选实战项目源码、Python精选实战项目源码、大数据精选实战项目源码系统展示【2025最新】基于大数据+大屏可视化+Python+D
Apache Hive--排序函数解析大鳥 apache hive hadoop
在大数据处理与分析中，ApacheHive是一个至关重要的数据仓库工具。其丰富的函数库为数据处理提供了诸多便利，排序函数便是其中一类非常实用的工具。通过排序函数，我们能够在查询结果集中为每一行数据分配一个排名值，这对于数据分析、报表生成等工作具有重要意义。本文将深入探讨ApacheHive中的排序函数，通过具体的HQL代码和数据实例进行说明，并阐述它们之间的区别。0.排序函数：ORDER、SORT
第八讲 SCQL使用 huang8666 数据库 mysql
第八讲SCQL使用部署系统项目设置联合分析scql概念：project：多个参与方在协商一致后加入到同一个项目中进行安全数据分析参与方身份认证数据表管理：管理参与分析的数据表的schema信息权限信息管理：表字段的权限信息，特别是CCL信息SCDB包含的内容：database,user,table,privilege创建用户通过root账户，语法时间戳，签名公钥地址：防止伪造身份攻击创建项目创建表
数据入湖的前提条件：数据标准之数据质量评估 goTsHgo 开发技巧大数据大数据
数据质量评估是数据入湖前必须满足的核心标准之一，其目的是确保数据的准确性、完整性、一致性和可靠性。通过系统化评估，能够最大限度地提升数据的价值，降低数据问题对业务决策的负面影响。下面从底层原理、详细步骤及背后原因进行全面解析。1.为什么需要数据质量评估？1.1确保数据可靠性含义：数据质量直接影响分析结果和业务决策，低质量数据会导致错误的模型输出或策略失败。原因：如果入湖数据质量不佳，后续数据分析、
基于python的时空地理加权回归（GTWR）模型有梦想的Frank博士数据处理数据分析回归空间分析时空异质性
一、时空地理加权回归（GTWR）模型时空地理加权回归（GTWR）模型是由美国科罗拉多州立大学的AndyLiaw、StanleyA.Fiel和MichaelE.Bock于2008年提出的一种高级空间统计分析方法。它是在传统地理加权回归（GWR）模型的基础上发展起来的，通过结合时间和空间两个维度，提供了一种更为灵活和精确的时空数据分析手段。背景和发展传统的地理加权回归（GWR）模型主要关注地理空间上的
工业互联网架构 st20195114 架构
工业互联网架构详解引言工业互联网（IndustrialInternet）是工业领域与互联网技术深度融合的产物，它推动了智能制造和数字化转型的进程。工业互联网架构的设计不仅需要满足数据处理和通信的要求，还需考虑设备互联、数据分析和安全等多方面的因素。本文将对工业互联网架构进行详细阐述，帮助理解其关键组成部分及其功能。工业互联网架构概述工业互联网架构通常包括设备层、网络层、数据层和应用层四个主要部分。
北大数学校友胡懿娟归国任教！重回母校，专注于统计学、微生物学和遗传学的交叉领域量子位
关注前沿科技量子位又一科学家从美归国——北大数学系校友胡懿娟。援引人民日报消息，在北京大学北京国际数学研究中心发布的2024年工作回顾中显示，她于去年7月入职北大。回来之后，她将继续专注于统计学、微生物学和遗传学的交叉领域，致力于解决实际的生物医学数据分析问题。△北大官网截图网友纷纷为她点赞：能力与颜值并存！同时也感叹，越来越多的科学家选择回到祖国，为科学技术发展和人才培养添砖加瓦。北大数学校友胡
Python字典详解 2401_89224765 python 开发语言
print(dict4)需要注意的是：fromkeys方法只用来创建新字典，不负责保存。当通过一个字典来调用fromkeys方法时，如果需要后续使用一定记得给他复制给其他的变量。②访问字典：第一阶段：基操勿6！如果要想获取字典中某个键的值，可以通过访问键的方式来显示对应的值。上代码：dict={‘线代’:“99”,“数据分析”:“99”,“概率论”:“98”}#创建字典print(‘小红同学的线代
使用Python爬虫将抓取的数据保存到Excel文件 Python爬虫项目 2025年爬虫实战项目 python 爬虫 excel 测试工具开发语言信息可视化
在进行Python爬虫开发时，数据的存储是非常重要的一环。随着数据分析需求的不断增长，保存和管理大量的数据变得尤为重要。CSV（Comma-SeparatedValues）格式一直是一个常见的存储格式，但在许多应用场景下，Excel文件作为一种更直观、结构化的方式，具有更多的优势，尤其在数据分析与可视化方面。Excel文件不仅能够承载数据，还能进行复杂的数据操作、图表展示等，使其在数据科学、商业分
Python的Matplotlib库详解 pumpkin84514 python相关 python matplotlib 开发语言
Python的Matplotlib库详解Matplotlib是Python中功能强大的数据可视化库，广泛应用于科研、数据分析、报告生成等领域。它能创建各种类型的图表，帮助用户直观地展示数据。一、使用场景1.数据探索和分析：在数据科学领域，Matplotlib经常被用来绘制各种图表，如折线图、散点图、直方图等，以帮助分析和理解数据。2.报告生成：科研人员和数据分析师常用Matplotlib生成图表，
Python 爬虫入门教程：从零构建你的第一个网络爬虫 m0_66323401 python 爬虫开发语言
网络爬虫是一种自动化程序，用于从网站抓取数据。Python凭借其丰富的库和简单的语法，是构建网络爬虫的理想语言。本文将带你从零开始学习Python爬虫的基本知识，并实现一个简单的爬虫项目。1.什么是网络爬虫？网络爬虫（WebCrawler）是一种通过网络协议（如HTTP/HTTPS）获取网页内容，并提取其中有用信息的程序。常见的爬虫用途包括：收集商品价格和评价。抓取新闻或博客内容。统计数据分析。爬
Python数据分析高频面试题及答案闲人编程程序员面试 python 数据分析面试题核心
目录1.基础知识2.数据处理3.数据可视化4.机器学习模型5.进阶问题6.数据清洗与预处理7.数据转换与操作8.时间序列分析9.高级数据分析技术10.数据降维与特征选择11.模型评估与优化12.数据操作与转换13.数据筛选与分析14.数据可视化与报告15.数据统计与分析16.高级数据处理以下是一些Python数据分析的高频核心面试题及其答案，涵盖了基础知识、数据1.基础知识问1：Python中列表
数据分析思维幽兰的天空 combo box 数据仓库大数据
了解数据分析的本质是什么在数据中寻找解决问题的方法。使用大量的数据、统计分析、定量、定性分析和预测模型及基于事实的管理来推动决策过程和实现价值增生。数据分析思维1.一个思维模型：目标导向分析法2.做好分析准备：探索性数据分析数据分析的四个层级1.描述性分析2.诊断性分析3.预测性分析4.决策性分析
Python数据分析常见面试题和答案01-10 飞翔还哈哈6 Python数据分析 python pandas 数据分析
以下是一些Python数据分析常见面试题和答案：1.Python中的list和tuple的区别是什么？答：List是可变的，而元组（tuple）是不可变的。因此，使用list来存储需要频繁修改的数据，而使用元组来存储不能更改的数据项。2.解释NumPy中的数组？为什么numpy在数据分析中很重要？答：NumPy是Python中提供高性能科学计算和数据分析的包。NumPy数组是一种类似于列表的数据结
【数据分析岗】关于数据分析岗面试python的金典问题+解答，包含数据读取、数据清洗、数据分析、机器学习等内容摇光~ 数据分析面试 python
大家好，我是摇光~，用大白话讲解所有你难懂的知识点最近和几个大佬交流了，说了很多关于现在职场面试等问题，然后也找他们问了问他们基本面试的话都会提什么问题。所以我收集了很多关于python的面试题，希望对大家面试有用。类别1：数据读取与处理问题1：如何用Python从Excel文件中读取数据？答：在Python中，可以使用pandas库从Excel文件中读取数据。pandas提供了read_exce
Python 数据建模完整流程指南木觞清 3天入门Python python 开发语言
在数据科学和机器学习中，建模是一个至关重要的过程。通过有效的数据建模，我们能够从原始数据中提取有用的洞察，并为预测或分类任务提供支持。在本篇博客中，我们将通过Python展示数据建模的完整流程，包括数据准备、建模、评估和优化等步骤。1.导入必要的库在进行任何数据分析或建模之前，首先需要导入必需的Python库。这些库提供了各种工具和算法，帮助我们更高效地完成任务。importnumpyasnpim
jvm调优总结（从基本概念到深度优化） oloz java jvm jdk 虚拟机应用服务器
JVM参数详解：http://www.cnblogs.com/redcreen/archive/2011/05/04/2037057.html Java虚拟机中，数据类型可以分为两类：基本类型和引用类型。基本类型的变量保存原始值，即：他代表的值就是数值本身；而引用类型的变量保存引用值。“引用值”代表了某个对象的引用，而不是对象本身，对象本身存放在这个引用值所表示的地址的位置。
【Scala十六】Scala核心十：柯里化函数 bit1129 scala
本篇文章重点说明什么是函数柯里化，这个语法现象的背后动机是什么，有什么样的应用场景，以及与部分应用函数(Partial Applied Function)之间的联系 1. 什么是柯里化函数 A way to write functions with multiple parameter lists. For instance def f(x: Int)(y: Int) is a
HashMap dalan_123 java
HashMap在java中对很多人来说都是熟的；基于hash表的map接口的非同步实现。允许使用null和null键；同时不能保证元素的顺序；也就是从来都不保证其中的元素的顺序恒久不变。 1、数据结构在java中，最基本的数据结构无外乎：数组和引用（指针），所有的数据结构都可以用这两个来构造，HashMap也不例外，归根到底HashMap就是一个链表散列的数据
Java Swing如何实时刷新JTextArea，以显示刚才加append的内容周凡杨 java 更新 swing JTextArea
在代码中执行完textArea.append("message")后，如果你想让这个更新立刻显示在界面上而不是等swing的主线程返回后刷新，我们一般会在该语句后调用textArea.invalidate()和textArea.repaint()。问题是这个方法并不能有任何效果，textArea的内容没有任何变化，这或许是swing的一个bug，有一个笨拙的办法可以实现
servlet或struts的Action处理ajax请求 g21121 servlet
其实处理ajax的请求非常简单，直接看代码就行了： //如果用的是struts //HttpServletResponse response = ServletActionContext.getResponse(); // 设置输出为文字流 response.setContentType("text/plain"); // 设置字符集 res
FineReport的公式编辑框的语法简介老A不折腾 finereport 公式总结
FINEREPORT用到公式的地方非常多，单元格（以=开头的便被解析为公式），条件显示，数据字典，报表填报属性值定义，图表标题，轴定义，页眉页脚，甚至单元格的其他属性中的鼠标悬浮提示内容都可以写公式。简单的说下自己感觉的公式要注意的几个地方： 1.if语句语法刚接触感觉比较奇怪，if(条件式子,值1,值2)，if可以嵌套，if(条件式子1，值1，if(条件式子2，值2，值3)
linux mysql 数据库乱码的解决办法墙头上一根草 linux mysql 数据库乱码
linux 上mysql数据库区分大小写的配置 lower_case_table_names=1 1-不区分大小写 0-区分大小写修改/etc/my.cnf 具体的修改内容如下: [client] default-character-set=utf8 [mysqld] datadir=/var/lib/mysql socket=/va
我的spring学习笔记6-ApplicationContext实例化的参数兼容思想 aijuans Spring 3
ApplicationContext能读取多个Bean定义文件，方法是： ApplicationContext appContext = new ClassPathXmlApplicationContext（ new String[]｛“bean-config1.xml”，“bean-config2.xml”，“bean-config3.xml”，“bean-config4.xml
mysql 基准测试之sysbench annan211 基准测试 mysql基准测试 MySQL测试 sysbench
1 执行如下命令，安装sysbench-0.5： tar xzvf sysbench-0.5.tar.gz cd sysbench-0.5 chmod +x autogen.sh ./autogen.sh ./configure --with-mysql --with-mysql-includes=/usr/local/mysql
sql的复杂查询使用案列与技巧百合不是茶 oracle sql 函数数据分页合并查询
本片博客使用的数据库表是oracle中的scott用户表; ------------------- 自然连接查询查询 smith 的上司(两种方法) &
深入学习Thread类 bijian1013 java thread 多线程 java多线程
一．线程的名字下面来看一下Thread类的name属性，它的类型是String。它其实就是线程的名字。在Thread类中，有String getName()和void setName(String)两个方法用来设置和获取这个属性的值。同时，Thr
JSON串转换成Map以及如何转换到对应的数据类型 bijian1013 java fastjson net.sf.json
在实际开发中，难免会碰到JSON串转换成Map的情况，下面来看看这方面的实例。另外，由于fastjson只支持JDK1.5及以上版本，因此在JDK1.4的项目中可以采用net.sf.json来处理。一.fastjson实例 JsonUtil.java package com.study; impor
【RPC框架HttpInvoker一】HttpInvoker：Spring自带RPC框架 bit1129 spring
HttpInvoker是Spring原生的RPC调用框架，HttpInvoker同Burlap和Hessian一样，提供了一致的服务Exporter以及客户端的服务代理工厂Bean，这篇文章主要是复制粘贴了Hessian与Spring集成一文，【RPC框架Hessian四】Hessian与Spring集成在【RPC框架Hessian二】Hessian 对象序列化和反序列化一文中
【Mahout二】基于Mahout CBayes算法的20newsgroup的脚本分析 bit1129 Mahout
#!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information re
nginx三种获取用户真实ip的方法 ronin47
随着nginx的迅速崛起，越来越多公司将apache更换成nginx. 同时也越来越多人使用nginx作为负载均衡, 并且代理前面可能还加上了CDN加速，但是随之也遇到一个问题：nginx如何获取用户的真实IP地址,如果后端是apache,请跳转到<apache获取用户真实IP地址>，如果是后端真实服务器是nginx，那么继续往下看。实例环境：用户IP 120.22.11.11
java-判断二叉树是不是平衡 bylijinnan java
参考了 http://zhedahht.blog.163.com/blog/static/25411174201142733927831/ 但是用java来实现有一个问题。由于Java无法像C那样“传递参数的地址，函数返回时能得到参数的值”，唯有新建一个辅助类：AuxClass import ljn.help.*; public class BalancedBTree {
BeanUtils.copyProperties VS PropertyUtils.copyProperties 诸葛不亮 PropertyUtils BeanUtils
BeanUtils.copyProperties VS PropertyUtils.copyProperties 作为两个bean属性copy的工具类，他们被广泛使用，同时也很容易误用，给人造成困然；比如：昨天发现同事在使用BeanUtils.copyProperties copy有integer类型属性的bean时，没有考虑到会将null转换为0，而后面的业
[金融与信息安全]最简单的数据结构最安全 comsci 数据结构
现在最流行的数据库的数据存储文件都具有复杂的文件头格式，用操作系统的记事本软件是无法正常浏览的，这样的情况会有什么问题呢？从信息安全的角度来看，如果我们数据库系统仅仅把这种格式的数据文件做异地备份，如果相同版本的所有数据库管理系统都同时被攻击，那么
vi区段删除 Cwind linux vi 区段删除
区段删除是编辑和分析一些冗长的配置文件或日志文件时比较常用的操作。简记下vi区段删除要点备忘。 vi概述引文中并未将末行模式单独列为一种模式。单不单列并不重要，能区分命令模式与末行模式即可。 vi区段删除步骤： 1. 在末行模式下使用:set nu显示行号非必须，随光标移动vi右下角也会显示行号，能够正确找到并记录删除开始行
清除tomcat缓存的方法总结 dashuaifu tomcat 缓存
用tomcat容器，大家可能会发现这样的问题，修改jsp文件后，但用IE打开依然是以前的Jsp的页面。出现这种现象的原因主要是tomcat缓存的原因。解决办法如下: 在jsp文件头加上 <meta http-equiv="Expires" content="0"> <meta http-equiv="kiben&qu
不要盲目的在项目中使用LESS CSS dcj3sjt126com Web less
　如果你还不知道LESS CSS是什么东西，可以看一下这篇文章，是我一朋友写给新人看的《CSS——LESS》　　不可否认，LESS CSS是个强大的工具，它弥补了css没有变量、无法运算等一些“先天缺陷”，但它似乎给我一种错觉，就是为了功能而实现功能。　　比如它的引用功能 ? .rounded_corners{
[入门]更上一层楼 dcj3sjt126com PHP yii2
更上一层楼通篇阅读完整个“入门”部分，你就完成了一个完整 Yii 应用的创建。在此过程中你学到了如何实现一些常用功能，例如通过 HTML 表单从用户那获取数据，从数据库中获取数据并以分页形式显示。你还学到了如何通过 Gii 去自动生成代码。使用 Gii 生成代码把 Web 开发中多数繁杂的过程转化为仅仅填写几个表单就行。本章将介绍一些有助于更好使用 Yii 的资源：
Apache HttpClient使用详解 eksliang httpclient http协议
Http协议的重要性相信不用我多说了，HttpClient相比传统JDK自带的URLConnection，增加了易用性和灵活性（具体区别，日后我们再讨论），它不仅是客户端发送Http请求变得容易，而且也方便了开发人员测试接口（基于Http协议的），即提高了开发的效率，也方便提高代码的健壮性。因此熟练掌握HttpClient是很重要的必修内容，掌握HttpClient后，相信对于Http协议的了解会
zxing二维码扫描功能 gundumw100 android zxing
经常要用到二维码扫描功能现给出示例代码 import com.google.zxing.WriterException; import com.zxing.activity.CaptureActivity; import com.zxing.encoding.EncodingHandler; import android.app.Activity; import an
纯HTML+CSS带说明的黄色导航菜单 ini html Web html5 css hovertree
HoverTree带说明的CSS菜单:纯HTML+CSS结构链接带说明的黄色导航在线体验效果：http://hovertree.com/texiao/css/1.htm代码如下,保存到HTML文件可以看到效果： <!DOCTYPE html > <html > <head> <title>HoverTree
fastjson初始化对性能的影响 kane_xie fastjson 序列化
之前在项目中序列化是用thrift，性能一般，而且需要用编译器生成新的类，在序列化和反序列化的时候感觉很繁琐，因此想转到json阵营。对比了jackson，gson等框架之后，决定用fastjson，为什么呢，因为看名字感觉很快。。。网上的说法： fastjson 是一个性能很好的 Java 语言实现的 JSON 解析器和生成器，来自阿里巴巴的工程师开发。
基于Mybatis封装的增删改查实现通用自动化sql mengqingyu DAO
1.基于map或javaBean的增删改查可实现不写dao接口和实现类以及xml，有效的提高开发速度。 2.支持自定义注解包括主键生成、列重复验证、列名、表名等 3.支持批量插入、批量更新、批量删除 <bean id="dynamicSqlSessionTemplate" class="com.mqy.mybatis.support.Dynamic
js控制input输入框的方法封装(数字，中文，字母，浮点数等) qifeifei javascript js
在项目开发的时候，经常有一些输入框，控制输入的格式，而不是等输入好了再去检查格式，格式错了就报错，体验不好。 /** 数字，中文，字母,浮点数(+/-/.) 类型输入限制，只要在input标签上加上 jInput="number,chinese,alphabet,floating" 备注：floating属性只能单独用*/ funct
java 计时器应用 tangqi609567707 java timer
mport java.util.TimerTask; import java.util.Calendar; public class MyTask extends TimerTask { private static final int
erlang输出调用栈信息 wudixiaotie erlang
在erlang otp的开发中，如果调用第三方的应用，会有有些错误会不打印栈信息，因为有可能第三方应用会catch然后输出自己的错误信息，所以对排查bug有很大的阻碍，这样就要求我们自己打印调用的栈信息。用这个函数：erlang:process_display (self (), backtrace).需要注意这个函数只会输出到标准错误输出。也可以用这个函数：erlang:get_s