在kaggle官网获取电影评分的数据,官方网址为:https://www.kaggle.com/rounakbanik/the-movies-dataset
用Google打开,因为需要注册和下载,网页如下图所示:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ab2Bgx4L-1596448565451)(138E684BBA5240568AE3217C4CD1F0E0)]
网页上有对数据的说明
点击最右侧的下载,需要注册kggle账户,会遇到注册时无法显示人机验证码问题,解决方案参考:https://www.cnblogs.com/liuxiaomin/p/11785645.html
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KBFUS5gr-1596448565459)(6D237A8C49C54EEE821C7A57EE13A164)]
总共八个文件,都是csv格式,可以直接用excel表格打开查看,也可以直接用python读取前5行查看文件格式和文件的信息,本文主要采取movies_metadata.csv的数据做预测。
import pandas as pd
df = pd.read_csv(r"movies_metadata.csv", engine='python') #读取文件
print("df.shape: ".format(df.shape)) #输出文件是几行几列
print("df.info(): ".format(df.info())) #输出文件各列的情况
print(df.head()) #输出文件前5行
用excel表格打开截取部分图片如下:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ar9GKulA-1596448565462)(BB9AE3D115D548A6A62BEA7C5D1CEC16)]
1、观察excel表格可以发现列名vote_average是我们的评分(图片中紫色矩形框的内容),也就是要预测的对象,这列数据的特征就是数字类型,不需要怎么处理;
2、第二种类型就是黄色举行框,列名为original_language这类数据,需要将字符串类型变成数字类型,并且采用red-hot-encoder
3、第三种数据类型就是列名为imdb_id,黄色举行框的数据,这类数据需要将数据进行处理,只保留数字
4、第四种数据类型的处理是比较难的,比如列名为spoken_languages,红色矩形框的格式,这类数据的格式形式上是字符串,实际上是一个列表,列表里面的元素是字典,需要将字典里的数字取出来,整个列表的内容也采用one-hot-encoder
5、第五类数据类型就是数据内容是文本,并且各个数据的文本几乎不同,对这列数据的处理是直接删除,比如列名为title的列。
根据上面对数据的处理思路,编写的代码如下(处理的是前10000行数据):
# -*- coding:utf-8 -*-
import pandas as pd
import time
from pandas import DataFrame
from pandas.core.dtypes.inference import is_number
from sklearn.ensemble import RandomForestRegressor
path = r"movies_metadata.csv"
df = pd.read_csv(path, engine='python')
globals1 = {
'nan': 0
}
globals2 = {
'FALSE': 0
} #用来解决eval()类型转换的问题
# 读表格内容为列表,列表里为字典的值,返回keyValue的种类
def readKinds(columName, keyValue):
resultList = []
for j in range(0, len(df)):
df2 = df.iloc[j,]
col = str(df2[columName])
ccol = eval(col,globals1,globals2)
if type(ccol) != int and type(ccol) != float:
for aa in ccol:
if aa[keyValue] not in resultList:
resultList.append(aa[keyValue])
return resultList
#读单元格为有限个字符串,返回字符串的种类list
def readKind(columName):
resultList = []
for j in range(0, len(df)):
df2 = df.iloc[j,]
col = df2[columName]
if col not in resultList:
resultList.append(col)
return resultList
list1 = readKinds('genres','id') #list1用来存genres的各种类型的id
list3 = readKind('original_language') #list3用来存original_language的各种类型
list4 = readKinds('production_companies','id') #list4用来存production_companies的各种类型
list5 = readKinds('production_countries','iso_3166_1') #list5用来存production_countries的各种类型
list6 = readKinds('spoken_languages','iso_639_1') #list6用来存spoken_languages的各种类型
# 读数据为字符串不需要类型转换的值,并对值进行转换,返回dataframe
def readData1(paths):
df = pd.read_csv(paths, engine='python')
# 定义resframe1的列名
resframe1 = pd.DataFrame(columns=('adult', 'belongs_to_collection', 'budget', \
'id', 'imdb_id', 'popularity', 'release_year', \
'release_mounth', 'release_day', 'revenue', \
'runtime', 'status', 'video', 'vote_count', 'vote_average'))
for i in range(0, 10000):
df2 = df.iloc[i,]
adult = str(df2['adult'])
if adult == 'FALSE':
adult_new = 0
else:
adult_new = 1
belongs = str(df2['belongs_to_collection'])
belongs_id = eval(belongs, globals1,globals2)
if type(belongs_id) != int and type(belongs_id) != float and type(belongs_id) != str:
belongs_to_collection = belongs_id['id']
if is_number(belongs_to_collection):
belongs_to_collection_id = belongs_id['id']
else:
belongs_to_collection_id = 0
else:
belongs_to_collection_id = 0
if is_number(df2['budget']):
budget = df2['budget']
else:
budget = 0
if is_number(df2['id']):
id = df2['id']
else:
id = 0
imdb_ids = (str(df2['imdb_id']))[2:]
if is_number(imdb_ids):
imdb_id = imdb_ids
else:
imdb_id = 0
if is_number(df2['popularity']):
popularity = df2['popularity']
else:
popularity = 0
release_date = str(df2['release_date'])
release_date = release_date.replace('-', '/')
if release_date != 'nan' and release_date not in ['1', '12', '22']:
tt = time.strptime(release_date, "%Y/%m/%d")
year = int(str(tt.tm_year))
mon = int(str(tt.tm_mon))
day = int(str(tt.tm_mday))
else:
year, mon, day = 0, 0, 0
if is_number(df2['revenue']):
revenue = df2['revenue']
else:
revenue = 0
if is_number(df2['runtime']):
runtime = df2['runtime']
else:
runtime = 0
status = df2['status']
if status == 'Released':
status_new = 1
else:
status_new = 0
video = df2['video']
if video == 'FALSE':
video_new = 1
else:
video_new = 0
if is_number(df2['vote_count']):
vote_count = df2['vote_count']
else:
vote_count = 0
if is_number(df2['vote_average']):
vote_average = df2['vote_average']
else:
vote_average = 0
# 往resframe1里一行一行加数据
resframe1 = resframe1.append([{'adult': adult_new, 'belongs_to_collection': belongs_to_collection_id, \
'budget': budget, 'id': id, 'imdb_id': imdb_id, 'popularity': popularity, \
'release_year': year, 'release_mounth': mon, 'release_day': day, \
'revenue': revenue, 'runtime': runtime, 'status': status_new, \
'video': video_new, 'vote_count': vote_count, 'vote_average': vote_average\
}],ignore_index=True)
return resframe1
# 用于处理字符串函数,返回list列表
def changeData(name,findValue,kindlist):
names = eval(name, globals1, globals2)
list2 = []
name_value = []
if type(names) == list:
for ll in names:
list2.append(ll[findValue])
for aa in kindlist:
if aa in list2:
name_value.append(1)
else:
name_value.append(0)
return name_value
# 将list[[],[],...]的列表转换为dataframe
def creatDataFrame(original_list,name):
name_list = []
for i in range(len(original_list[0])):
aa = name + '[' + str(i) +']'
name_list.append(aa)
return DataFrame(original_list,columns=name_list)
# 用来读数据原本为list的字符串,返回多个dataframe
def readData2(paths):
df = pd.read_csv(paths, engine='python')
genres_list = []
original_language_list = []
companys_list = []
countrys_list = []
languages_list = []
for i in range(0, 10000):
df2 = df.iloc[i,]
genre = str(df2['genres'])
genre_list = changeData(genre,'id',list1)
original_language = str(df2['original_language'])
original_language_value = []
for aa in list3:
if aa == original_language:
original_language_value.append(1)
else:
original_language_value.append(0)
production_companies = str(df2['production_companies'])
company_list = changeData(production_companies,'id', list4)
production_countries = str(df2['production_countries'])
country_list = changeData(production_countries, 'iso_3166_1', list5)
spoken_languages = str(df2['spoken_languages'])
language_list = changeData(spoken_languages, 'iso_639_1', list6)
genres_list.append(genre_list)
original_language_list.append(original_language_value)
companys_list.append(company_list)
countrys_list.append(country_list)
languages_list.append(language_list)
genres_frame = creatDataFrame(genres_list, 'genres')
languages_frame = creatDataFrame(original_language_list, 'languages')
companys_frame = creatDataFrame(companys_list, 'companys')
countrys_frame = creatDataFrame(countrys_list, 'countrys')
spokens_frame = creatDataFrame(languages_list, 'spokens')
return genres_frame,languages_frame,companys_frame,countrys_frame,spokens_frame
#主函数
def main():
# 将读取的dataframe进行拼接成一个新的dataframe
(genres_frame, languages_frame, companys_frame, countrys_frame, spokens_frame) = readData2(path)
# newframe为读取文件处理后的结果
newframe = pd.concat(
[genres_frame, languages_frame, companys_frame, countrys_frame, spokens_frame, readData1(path)], axis=1) #将处理过的数据进行拼接
1、对某列缺失值过多,并且该列对评分没有很大影响,直接删除,这个数据暂时没有这类的;
2、对某列缺失值过少的数据,不足2%,采取0或均值填充,或者直接删除,我在此处采取的是0填充。
newframe = newframe.fillna(0) #对缺失数据进行0填充
对重复数据直接删除,但是要记住只要删除了数据就要恢复一下索引
newframe.drop_duplicates(inplace=True) # 删除重复值
newframe.index = range(newframe.shape[0]) # 删除之后千万不要忘记,恢复索引
查看各列数据的均值方差最大最小值等情况:
print(newframe.describe([0.01,.99]).T) #查看各列数据的分布
输出如下图所示:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kp3YE5uP-1596448565548)(260358D6D3C94E43B773E182947E08DE)]
runtime等于720的数据感觉有问题,暂时把它删除
vote_average等于10的数据感觉有问题,电影评分很难到10分吧,暂时把它删除
newframe = newframe[newframe.loc[:,"runtime"] < 720] # 将runtime及运行时间大于等于720的行删除
newframe = newframe[newframe.loc[:, "vote_average"] < 10] #将vote_average及评分大于等于10的行删除
newframe.index = range(newframe.shape[0]) #恢复索引
因为处理后的数据的列数达到了24126,采用方差过滤算法,去除特征本身方差很小的特。特征本身方差小表示样本在这个特征上基本没有差异,可能特征中的大多数值都一样,那么这个特征对于样本区分没有什么作用,所以,可以先消除方差为0的特征,这样数据也没有那么大。
代码如下:
data_x = newframe.iloc[:, newframe.columns != "vote_average"] #将特征隔离出来
data_y = (newframe.loc[:, "vote_average"]).astype("float") #将标签隔离出来
from sklearn.feature_selection import VarianceThreshold #导入方差过滤的包,方差过滤是对特征过滤
selector = VarianceThreshold() # 实例化,不填参数默认方差为0
X = selector.fit_transform(data_x) # 获取删除不合格特征之后的新特征矩阵
print(X.columns())
将数据保存到一个新的csv文档里
data = pd.concat([pd.DataFrame(X),pd.DataFrame(data_y)],axis=1) #将处理后的特征与标签连接成一个新的dataframe
data.to_csv(r"movies_data_new.csv", sep=',', header=True, index=False)
效果最好的随机森林,在没有调参的情况下有0.72分
得分:0.23
代码:
# -*- coding:utf-8 -*-
import pandas as pd
from sklearn.model_selection import train_test_split
#决策树模型 0.23
def decisionTree(Xtrain, Ytrain,Xtest, Ytest):
from sklearn import tree
clf = tree.DecisionTreeRegressor(
# , random_state=30
splitter='random'#有best和random两个可选项,random更加随机
,max_depth=10 #数的深度最多为10
# ,min_samples_leaf=10 #用来设定最小的叶子节点必须包含大于10个样本
# ,min_samples_split=20 #用来设定往下分的节点,节点数必须大于20
) # 加上random_state这个参数后,score就不会总是变化 #第一步实例化
clf = clf.fit(Xtrain,Ytrain) #第二部步模型导进去
score = clf.score(Xtest,Ytest) #第三步,给模型打分
return score
def main():
df = pd.read_csv(r"movies_data_new.csv", engine='python') #读取数据
data_x = df.iloc[:,0:-1] #将数据的特征分离出来
data_y = df.iloc[:,-1] #将特征的标签分离出来
Xtrain, Xtest, Ytrain, Ytest = train_test_split(data_x, data_y, test_size=0.3, random_state=30) #区分测试集和训练集
score = decisionTree(Xtrain, Ytrain,Xtest, Ytest) #将数据带入模型,取得得分
print(score)
if __name__ == '__main__':
main()
得分:0.72
代码:
# 随机森林 0.64
def randomForest(Xtrain, Ytrain,Xtest, Ytest):
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# clf = RandomForestClassifier(n_estimators=10, random_state=0).fit(Xtrain, Ytrain)
clf = RandomForestRegressor(n_estimators=100, random_state=0).fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest)
return score
需要将连续型标签变成离散型才能跑
得分:0.10
代码:
#用朴素贝叶斯模型 0.10
def gaussianNB(Xtrain, Ytrain,Xtest, Ytest):
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(Xtrain, Ytrain)
acc_score = gnb.score(Xtest, Ytest)
return acc_score
需要将连续型标签变成离散型才能跑
得分:0.38
代码:
#逻辑回归 0.38
def logisticRegression(Xtrain, Ytrain,Xtest, Ytest):
from sklearn.linear_model import LogisticRegression
logc = LogisticRegression().fit(Xtrain, Ytrain)
score = logc.score(Xtest, Ytest)
return score
得分:0.02
代码:
#K近邻 0.02
def kNeighbors(Xtrain, Ytrain,Xtest, Ytest):
from sklearn.neighbors import KNeighborsRegressor
k = KNeighborsRegressor(3)
kn = k.fit(Xtrain, Ytrain)
score = k.score(Xtest, Ytest)
return score
需要将连续型标签变成离散型才能跑
得分:0.11
代码:
# 二次判别分析 0.11
def quadraticDiscriminantAnalysis(Xtrain, Ytrain,Xtest, Ytest):
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
clf = QuadraticDiscriminantAnalysis().fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest)
return score
得分:0.37
代码:
# 线性判别分析0.37
def linearDiscriminantAnalysis(Xtrain, Ytrain,Xtest, Ytest):
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis().fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest)
用MLPRegressor跑得到一个很大的负数
得分:0.34
代码:
# 多层感知机0.25
def mLP(Xtrain, Ytrain,Xtest, Ytest):
# from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(alpha=0).fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest)
return score