机器学习训练营——机器学习爱好者的自由交流空间(入群联系qq:2279055353)
2018年电影工业蓬勃发展,全球电影票房估值超过400亿美元。哪些电影最赚钱?本案例使用来自The Movie Database的超过7,000部已公映电影的元数据,预测单部电影的全球票房收入。
数据集包括7,398部电影的元数据,每部电影由id
确定。数据点包括cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries
.
你的任务是预测检验集里的4,398部电影的全球票房。注意,由于有些电影被翻拍后上映,因此数据集里可能存在一部电影的多个实例。然而,它们实际上是不同的电影。
import numpy as np
import pandas as pd
pd.set_option('max_columns', None)
import matplotlib.pyplot as plt
import seaborn as sns
import ast
plt.style.use('ggplot')
import datetime
import lightgbm as lgb
from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split, KFold
from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
dict_columns = ['belongs_to_collection', 'genres', 'production_companies',
'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']
def text_to_dict(df):
for column in dict_columns:
df[column] = df[column].apply(lambda x: {} if pd.isna(x) else ast.literal_eval(x) )
return df
train = text_to_dict(train)
test = text_to_dict(test)
print(train.head())
print(train.shape)
print(test.shape)
(3000, 23)
(4398, 22)
在训练集有3000个样本,一些列变量包括字典列表。现在,我们从这些列提取数据。
for i, e in enumerate(train['belongs_to_collection'][:5]):
print(i, e)
train['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0).value_counts()
在belongs_to_collection
里,有2396个空值,604个包含信息。
train['collection_name'] = train['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0)
train['has_collection'] = train['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)
test['collection_name'] = test['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0)
test['has_collection'] = test['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)
train = train.drop(['belongs_to_collection'], axis=1)
test = test.drop(['belongs_to_collection'], axis=1)
for i, e in enumerate(train['genres'][:5]):
print(i, e)
print('Number of genres in films')
train['genres'].apply(lambda x: len(x) if x != {} else 0).value_counts()
现在,我们看一看genres
, 画出它的词云。
list_of_genres = list(train['genres'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
plt.figure(figsize = (12, 8))
text = ' '.join([i for j in list_of_genres for i in j])
wordcloud = WordCloud(max_font_size=None, background_color='white', collocations=False,
width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top genres')
plt.axis("off")
plt.show()
Drama
, Comedy
and Thriller
是流行的genres.
有很多描述电影的关键词,我们用词云查找最普遍的词。
list_of_keywords = list(train['Keywords'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)
plt.figure(figsize = (16, 12))
text = ' '.join(['_'.join(i.split(' ')) for j in list_of_keywords for i in j])
wordcloud = WordCloud(max_font_size=None, background_color='black', collocations=False,
width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top keywords')
plt.axis("off")
plt.show()