电影评分预测系统分析

一、数据获取

(一)数据源地址:

在kaggle官网获取电影评分的数据,官方网址为:https://www.kaggle.com/rounakbanik/the-movies-dataset
用Google打开,因为需要注册和下载,网页如下图所示:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ab2Bgx4L-1596448565451)(138E684BBA5240568AE3217C4CD1F0E0)]
网页上有对数据的说明

(二)遇到的问题

点击最右侧的下载,需要注册kggle账户,会遇到注册时无法显示人机验证码问题,解决方案参考:https://www.cnblogs.com/liuxiaomin/p/11785645.html

(三)下载后文件查看

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KBFUS5gr-1596448565459)(6D237A8C49C54EEE821C7A57EE13A164)]
总共八个文件,都是csv格式,可以直接用excel表格打开查看,也可以直接用python读取前5行查看文件格式和文件的信息,本文主要采取movies_metadata.csv的数据做预测。

import pandas as pd
df = pd.read_csv(r"movies_metadata.csv", engine='python')   #读取文件
print("df.shape:  ".format(df.shape))   #输出文件是几行几列
print("df.info():  ".format(df.info())) #输出文件各列的情况
print(df.head())    #输出文件前5行

二、数据预处理

(一)将数据转换为数字类型

用excel表格打开截取部分图片如下:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ar9GKulA-1596448565462)(BB9AE3D115D548A6A62BEA7C5D1CEC16)]
1、观察excel表格可以发现列名vote_average是我们的评分(图片中紫色矩形框的内容),也就是要预测的对象,这列数据的特征就是数字类型,不需要怎么处理;

2、第二种类型就是黄色举行框,列名为original_language这类数据,需要将字符串类型变成数字类型,并且采用red-hot-encoder

3、第三种数据类型就是列名为imdb_id,黄色举行框的数据,这类数据需要将数据进行处理,只保留数字

4、第四种数据类型的处理是比较难的,比如列名为spoken_languages,红色矩形框的格式,这类数据的格式形式上是字符串,实际上是一个列表,列表里面的元素是字典,需要将字典里的数字取出来,整个列表的内容也采用one-hot-encoder

5、第五类数据类型就是数据内容是文本,并且各个数据的文本几乎不同,对这列数据的处理是直接删除,比如列名为title的列。

根据上面对数据的处理思路,编写的代码如下(处理的是前10000行数据):

# -*- coding:utf-8 -*-
import pandas as pd
import time
from pandas import DataFrame
from pandas.core.dtypes.inference import is_number
from sklearn.ensemble import RandomForestRegressor
path = r"movies_metadata.csv"
df = pd.read_csv(path, engine='python')
globals1 = {
        'nan': 0
    }
globals2 = {
        'FALSE': 0
    }   #用来解决eval()类型转换的问题
# 读表格内容为列表,列表里为字典的值,返回keyValue的种类
def readKinds(columName, keyValue):
    resultList = []
    for j in range(0, len(df)):
        df2 = df.iloc[j,]
        col = str(df2[columName])
        ccol = eval(col,globals1,globals2)
        if type(ccol) != int and type(ccol) != float:
            for aa in ccol:
                if aa[keyValue] not in resultList:
                    resultList.append(aa[keyValue])
    return resultList
#读单元格为有限个字符串,返回字符串的种类list
def readKind(columName):
    resultList = []
    for j in range(0, len(df)):
        df2 = df.iloc[j,]
        col = df2[columName]
        if col not in resultList:
            resultList.append(col)
    return resultList
list1 = readKinds('genres','id')  #list1用来存genres的各种类型的id
list3 = readKind('original_language') #list3用来存original_language的各种类型
list4 = readKinds('production_companies','id')  #list4用来存production_companies的各种类型
list5 = readKinds('production_countries','iso_3166_1')  #list5用来存production_countries的各种类型
list6 = readKinds('spoken_languages','iso_639_1')  #list6用来存spoken_languages的各种类型

# 读数据为字符串不需要类型转换的值,并对值进行转换,返回dataframe
def readData1(paths):
    df = pd.read_csv(paths, engine='python')
    # 定义resframe1的列名
    resframe1 = pd.DataFrame(columns=('adult', 'belongs_to_collection', 'budget', \
                                      'id', 'imdb_id', 'popularity', 'release_year', \
                                      'release_mounth', 'release_day', 'revenue', \
                                      'runtime', 'status', 'video', 'vote_count', 'vote_average'))
    for i in range(0, 10000):
        df2 = df.iloc[i,]
        adult = str(df2['adult'])
        if adult == 'FALSE':
            adult_new = 0
        else:
            adult_new = 1

        belongs = str(df2['belongs_to_collection'])
        belongs_id = eval(belongs, globals1,globals2)
        if type(belongs_id) != int and type(belongs_id) != float and type(belongs_id) != str:
            belongs_to_collection = belongs_id['id']
            if is_number(belongs_to_collection):
                belongs_to_collection_id = belongs_id['id']
            else:
                belongs_to_collection_id = 0
        else:
            belongs_to_collection_id = 0

        if is_number(df2['budget']):
            budget = df2['budget']
        else:
            budget = 0

        if is_number(df2['id']):
            id = df2['id']
        else:
            id = 0

        imdb_ids = (str(df2['imdb_id']))[2:]
        if is_number(imdb_ids):
            imdb_id = imdb_ids
        else:
            imdb_id = 0

        if is_number(df2['popularity']):
            popularity = df2['popularity']
        else:
            popularity = 0

        release_date = str(df2['release_date'])
        release_date = release_date.replace('-', '/')
        if release_date != 'nan' and release_date not in ['1', '12', '22']:
            tt = time.strptime(release_date, "%Y/%m/%d")
            year = int(str(tt.tm_year))
            mon = int(str(tt.tm_mon))
            day = int(str(tt.tm_mday))
        else:
            year, mon, day = 0, 0, 0

        if is_number(df2['revenue']):
            revenue = df2['revenue']
        else:
            revenue = 0

        if is_number(df2['runtime']):
            runtime = df2['runtime']
        else:
            runtime = 0

        status = df2['status']
        if status == 'Released':
            status_new = 1
        else:
            status_new = 0
        video = df2['video']
        if video == 'FALSE':
            video_new = 1
        else:
            video_new = 0

        if is_number(df2['vote_count']):
            vote_count = df2['vote_count']
        else:
            vote_count = 0

        if is_number(df2['vote_average']):
            vote_average = df2['vote_average']
        else:
            vote_average = 0

        # 往resframe1里一行一行加数据
        resframe1 = resframe1.append([{'adult': adult_new, 'belongs_to_collection': belongs_to_collection_id, \
                                       'budget': budget, 'id': id, 'imdb_id': imdb_id, 'popularity': popularity, \
                                       'release_year': year, 'release_mounth': mon, 'release_day': day, \
                                       'revenue': revenue, 'runtime': runtime, 'status': status_new, \
                                       'video': video_new, 'vote_count': vote_count, 'vote_average': vote_average\
                                       }],ignore_index=True)
    return resframe1

# 用于处理字符串函数,返回list列表
def changeData(name,findValue,kindlist):
    names = eval(name, globals1, globals2)
    list2 = []
    name_value = []
    if type(names) == list:
        for ll in names:
            list2.append(ll[findValue])
    for aa in kindlist:
        if aa in list2:
            name_value.append(1)
        else:
            name_value.append(0)
    return name_value

# 将list[[],[],...]的列表转换为dataframe
def creatDataFrame(original_list,name):
    name_list = []
    for i in range(len(original_list[0])):
        aa = name + '[' + str(i) +']'
        name_list.append(aa)
    return DataFrame(original_list,columns=name_list)

# 用来读数据原本为list的字符串,返回多个dataframe
def readData2(paths):
    df = pd.read_csv(paths, engine='python')
    genres_list = []
    original_language_list = []
    companys_list = []
    countrys_list = []
    languages_list = []
    for i in range(0, 10000):
        df2 = df.iloc[i,]
        genre = str(df2['genres'])
        genre_list = changeData(genre,'id',list1)

        original_language = str(df2['original_language'])
        original_language_value = []
        for aa in list3:
            if aa == original_language:
                original_language_value.append(1)
            else:
                original_language_value.append(0)

        production_companies = str(df2['production_companies'])
        company_list = changeData(production_companies,'id', list4)

        production_countries = str(df2['production_countries'])
        country_list = changeData(production_countries, 'iso_3166_1', list5)

        spoken_languages = str(df2['spoken_languages'])
        language_list = changeData(spoken_languages, 'iso_639_1', list6)

        genres_list.append(genre_list)
        original_language_list.append(original_language_value)
        companys_list.append(company_list)
        countrys_list.append(country_list)
        languages_list.append(language_list)
    genres_frame = creatDataFrame(genres_list, 'genres')
    languages_frame = creatDataFrame(original_language_list, 'languages')
    companys_frame = creatDataFrame(companys_list, 'companys')
    countrys_frame = creatDataFrame(countrys_list, 'countrys')
    spokens_frame = creatDataFrame(languages_list, 'spokens')
    return genres_frame,languages_frame,companys_frame,countrys_frame,spokens_frame

#主函数
def main():
    # 将读取的dataframe进行拼接成一个新的dataframe
    (genres_frame, languages_frame, companys_frame, countrys_frame, spokens_frame) = readData2(path)
    # newframe为读取文件处理后的结果
    newframe = pd.concat(
        [genres_frame, languages_frame, companys_frame, countrys_frame, spokens_frame, readData1(path)], axis=1) #将处理过的数据进行拼接
        

(二)对缺失数据的处理

1、对某列缺失值过多,并且该列对评分没有很大影响,直接删除,这个数据暂时没有这类的;

2、对某列缺失值过少的数据,不足2%,采取0或均值填充,或者直接删除,我在此处采取的是0填充。

newframe = newframe.fillna(0)   #对缺失数据进行0填充

(三)对重复数据的处理

对重复数据直接删除,但是要记住只要删除了数据就要恢复一下索引

newframe.drop_duplicates(inplace=True)  # 删除重复值
newframe.index = range(newframe.shape[0])  # 删除之后千万不要忘记,恢复索引

(四)对异常数据的处理

查看各列数据的均值方差最大最小值等情况:

print(newframe.describe([0.01,.99]).T)    #查看各列数据的分布

输出如下图所示:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kp3YE5uP-1596448565548)(260358D6D3C94E43B773E182947E08DE)]

runtime等于720的数据感觉有问题,暂时把它删除

vote_average等于10的数据感觉有问题,电影评分很难到10分吧,暂时把它删除

newframe = newframe[newframe.loc[:,"runtime"] < 720]    # 将runtime及运行时间大于等于720的行删除
newframe = newframe[newframe.loc[:, "vote_average"] < 10]   #将vote_average及评分大于等于10的行删除
newframe.index = range(newframe.shape[0])   #恢复索引

三、特征选择

因为处理后的数据的列数达到了24126,采用方差过滤算法,去除特征本身方差很小的特。特征本身方差小表示样本在这个特征上基本没有差异,可能特征中的大多数值都一样,那么这个特征对于样本区分没有什么作用,所以,可以先消除方差为0的特征,这样数据也没有那么大。
代码如下:

data_x = newframe.iloc[:, newframe.columns != "vote_average"]   #将特征隔离出来
data_y = (newframe.loc[:, "vote_average"]).astype("float")  #将标签隔离出来
from sklearn.feature_selection import VarianceThreshold  #导入方差过滤的包,方差过滤是对特征过滤
selector = VarianceThreshold()  # 实例化,不填参数默认方差为0
X = selector.fit_transform(data_x)  # 获取删除不合格特征之后的新特征矩阵
print(X.columns())

将数据保存到一个新的csv文档里

data = pd.concat([pd.DataFrame(X),pd.DataFrame(data_y)],axis=1)    #将处理后的特征与标签连接成一个新的dataframe
data.to_csv(r"movies_data_new.csv", sep=',', header=True, index=False)

四、使用sklearn各模型得分

效果最好的随机森林,在没有调参的情况下有0.72分

(一)、回归树(DecisionTreeRegressor)

得分:0.23

代码:

# -*- coding:utf-8 -*-
import pandas as pd
from sklearn.model_selection import train_test_split
#决策树模型 0.23
def decisionTree(Xtrain, Ytrain,Xtest, Ytest):
    from sklearn import tree
    clf = tree.DecisionTreeRegressor(
                                      # , random_state=30
                                       splitter='random'#有best和random两个可选项,random更加随机
                                      ,max_depth=10 #数的深度最多为10
                                      # ,min_samples_leaf=10 #用来设定最小的叶子节点必须包含大于10个样本
                                      # ,min_samples_split=20 #用来设定往下分的节点,节点数必须大于20
                                      )  # 加上random_state这个参数后,score就不会总是变化   #第一步实例化
    clf = clf.fit(Xtrain,Ytrain) #第二部步模型导进去
    score = clf.score(Xtest,Ytest) #第三步,给模型打分
    return score

def main():
    df = pd.read_csv(r"movies_data_new.csv", engine='python')   #读取数据
    data_x = df.iloc[:,0:-1]    #将数据的特征分离出来
    data_y = df.iloc[:,-1]  #将特征的标签分离出来
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(data_x, data_y, test_size=0.3, random_state=30) #区分测试集和训练集
    score = decisionTree(Xtrain, Ytrain,Xtest, Ytest)   #将数据带入模型,取得得分
    print(score)

if __name__ == '__main__':
    main()

(二)、随机森林(RandomForestRegressor)

得分:0.72

代码:

# 随机森林 0.64
def randomForest(Xtrain, Ytrain,Xtest, Ytest):
    from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
    # clf = RandomForestClassifier(n_estimators=10, random_state=0).fit(Xtrain, Ytrain)
    clf = RandomForestRegressor(n_estimators=100, random_state=0).fit(Xtrain, Ytrain)
    score = clf.score(Xtest, Ytest)
    return score

(三)、用朴素贝叶斯模型决策树(GaussianNB)

需要将连续型标签变成离散型才能跑
得分:0.10
代码:

#用朴素贝叶斯模型 0.10
def gaussianNB(Xtrain, Ytrain,Xtest, Ytest):
    from sklearn.naive_bayes import GaussianNB
    gnb = GaussianNB().fit(Xtrain, Ytrain)
    acc_score = gnb.score(Xtest, Ytest)
    return acc_score

(四)逻辑回归(LogisticRegression)

需要将连续型标签变成离散型才能跑
得分:0.38
代码:

#逻辑回归 0.38
def logisticRegression(Xtrain, Ytrain,Xtest, Ytest):
    from sklearn.linear_model import LogisticRegression
    logc = LogisticRegression().fit(Xtrain, Ytrain)
    score = logc.score(Xtest, Ytest)
    return score

(五)K近邻(KNeighborsRegressor)

得分:0.02
代码:

#K近邻  0.02
def kNeighbors(Xtrain, Ytrain,Xtest, Ytest):
    from sklearn.neighbors import KNeighborsRegressor
    k = KNeighborsRegressor(3)
    kn = k.fit(Xtrain, Ytrain)
    score = k.score(Xtest, Ytest)
    return score

(六)二次判别分析(QuadraticDiscriminantAnalysis)

需要将连续型标签变成离散型才能跑
得分:0.11
代码:

# 二次判别分析 0.11
def quadraticDiscriminantAnalysis(Xtrain, Ytrain,Xtest, Ytest):
    from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
    clf = QuadraticDiscriminantAnalysis().fit(Xtrain, Ytrain)
    score = clf.score(Xtest, Ytest)
    return score

(七)线性判别分析(LinearDiscriminantAnalysis)

得分:0.37
代码:

# 线性判别分析0.37
def linearDiscriminantAnalysis(Xtrain, Ytrain,Xtest, Ytest):
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    clf = LinearDiscriminantAnalysis().fit(Xtrain, Ytrain)
    score = clf.score(Xtest, Ytest)

(八)多层感知机(MLPClassifier)

用MLPRegressor跑得到一个很大的负数
得分:0.34
代码:

# 多层感知机0.25
def mLP(Xtrain, Ytrain,Xtest, Ytest):
    # from sklearn.neural_network import MLPClassifier
    from sklearn.neural_network import MLPClassifier
    clf = MLPClassifier(alpha=0).fit(Xtrain, Ytrain)
    score = clf.score(Xtest, Ytest)
    return score

你可能感兴趣的:(#,sklearn)