movielens推荐
As many of us can assume, the availability of movies is endless to the point that a person could watch a new movie every waking hour. However, we often end up finding ourselves searching through the movie selection that our streaming services have available (Netflix, Amazon Prime Video, Hulu, etc.), hoping to find the right movie to fit our interests, if not just to fit our mood. Unlike physically browsing through Blockbuster in the past, streaming services attempt to lessen the time we browse through their selection by providing us movie recommendations immediately after logging into the service. And to produce those recommendations, they use data science—specifically machine learning. In this article, I will explain step-by-step on how I made my version of a recommendation system.
正如我们许多人可以假设的那样,电影的提供范围是无限的,以至于每个人每醒一个小时就可以观看一部新电影。 但是,我们常常最终发现自己正在搜索我们的流媒体服务可用的电影选择(Netflix,Amazon Prime Video,Hulu等),希望找到合适的电影来满足我们的兴趣,甚至不仅仅是为了适应我们的心情。 。 与过去以物理方式浏览Blockbuster的方式不同,流媒体服务试图通过在登录服务后立即向我们提供电影推荐来减少我们浏览其选择内容的时间。 为了提出这些建议,他们使用了数据科学,特别是机器学习。 在本文中,我将逐步介绍如何制作推荐系统的版本。
Click Here to Access the GitHub Repository: To run, follow the instructions in the README or in the script. Quick model stats: the accuracy of this system was ~65% for predicting near the actual rating and ~73% for predicting whether a user would like or dislike a movie.
单击此处访问GitHub存储库 :要运行,请按照README或脚本中的说明进行操作。 快速的模型统计:该系统的预测准确率接近实际等级时约为65%,预测用户喜欢或不喜欢电影时的准确率约为73%。
电影分级数据库 (Movie Ratings Database)
The first thing to cover in all data science projects is the data source. There are many different databases available to use for movie recommendation systems. I’ve decided to design my system using the MovieLens 25M Dataset that is provided for free by grouplens, a research lab at the University of Minnesota. This dataset contains 25,000,095 movie ratings from 162541 users, with the rating scale ranging between 0.5 to 5.0.
所有数据科学项目中要涵盖的第一件事是数据源。 有许多不同的数据库可用于电影推荐系统。 我决定使用由明尼苏达大学的研究实验室grouplens免费提供的MovieLens 25M数据集来设计系统。 该数据集包含来自162541个用户的25,000,095个电影评分,评分范围在0.5到5.0之间。
Though there are many files in the downloaded zip file, I will only be using movies.csv, ratings.csv, and tags.csv.
尽管下载的zip文件中有很多文件,但我只会使用movie.csv,ratings.csv和tags.csv。
警告! (Warning!)
Looking at just the sheer amount of ratings in this dataset, this could potentially raise a red flag — 25 million rows in one CSV file is not an easy feat for RAM storage when pre-processing! Depending on your system, loading in the movie ratings should be fine… until you start merging these CSVs… A simple early fix can bypass this issue and will be addressed later.
仅查看此数据集中的绝对数量,这可能会引发危险信号-预处理时,一个CSV文件中的2500万行对RAM存储而言并非易事! 根据您的系统,在电影分级中加载影片应该没问题……直到您开始合并这些CSV为止……一个简单的早期解决方法可以绕过此问题,稍后将进行解决。
演示地址
数据整理 (Data Wrangling)
A glance at the contents of the three CSV files will immediately show that movies.csv and tags.csv need a bit of string manipulation and filtering before merging can even begin. (Even though I processed some information in the GitHub repository, not all processed data was used as input data and will not be mentioned in this article.)
只需看一下这三个CSV文件的内容,即可立即知道movie.csv和tags.csv需要一点字符串操作和过滤,才能开始合并。 (即使我在GitHub存储库中处理了一些信息,但并非所有处理过的数据都用作输入数据,因此本文中将不予提及。)
电影 (MOVIES.CSV)
Luckily grouplens made this dataset easier to manage with uniform formatting so all that was needed to be done was to extract the genres out and place them into individual columns. But, I needed to first determine what were the unique genres used in the dataset:
幸运的是,grouplens使该数据集更易于使用统一的格式进行管理,因此所需要做的就是将流派提取出来并将其放置在单独的列中。 但是,我首先需要确定数据集中使用的独特流派是什么:
The code is to extract all the genres in movies.csv into a list, then changing the list to a set to only keep unique genres, and then transforming it back to a list to be able to use list methods later on if necessary. When comparing the unique genres obtained from the dataset (middle image) VS the genres listed in the README (right image), the “IMAX” genre is missing from the README list.
代码是将movie.csv中的所有流派提取到一个列表中,然后将列表更改为仅保留唯一流派的集合,然后将其转换回列表,以便以后在必要时可以使用列表方法。 当比较从数据集(中间图像)获得的独特风格与自述文件(右图)中列出的风格时,自述文件列表中缺少“ IMAX”风格。
Although the IMAX genre is missing from the README, that might have been for a good reason, which most likely has been that IMAX wasn’t part of the dataset in previous versions and the people at grouplens simply forgot to update the README file. But, the reason I bring this up is that IMAX isn’t a genre… it’s a viewing feature. So, I had decided to remove this genre from my list of unique genres before making individual genre columns. I also changed the genre “(no genres listed)” to be “None” just to make things a bit easier and uniform to remember when coding.
尽管README中缺少IMAX类型,但这可能是有充分的理由的,最有可能的原因是IMAX不是以前版本中的数据集的一部分,而处于组眼镜状态的人只是忘记了更新README文件。 但是,我提出这一点的原因是IMAX不是流派,而是一种查看功能。 因此,我决定在制作单独的类型栏之前,从我的唯一类型列表中删除该类型 。 我还将类型((未列出任何类型))更改为“无”,只是为了使编码时的事情更容易记住。
标记.CSV (TAGS.CSV)
Now, this is where things become a bit tricky because the tags in this dataset were user-created, and many of them are somewhat more opinionated than others. But before determining which opinions should be kept, I lowercased all tags and removed any parenthesis from the tags, such as “Oscar (Best Supporting Actress)”.
现在,事情变得有些棘手,因为此数据集中的标记是用户创建的,而且其中许多标记比其他标记更自以为是。 但是在确定应保留哪些意见之前,我将所有标签都小写并从标签中删除了所有括号,例如“ Oscar(最佳女配角)”。
To remove the parenthesis and text enclosed, I used Python’s Regular Expressions module (“import re”). Afterward, to do the simplest method of Natural Language Processing (NLP) without using any additional libraries, I gave the tags a brief look and decided on what should determine which opinions were too opinionated.
为了删除括号和文本,我使用了Python的正则表达式模块(“ import re”)。 然后,为了在不使用任何其他库的情况下执行自然语言处理(NLP)的最简单方法,我给了标签一个简短的外观,并决定了应该由哪些决定来决定哪些观点太过自以为是。
Just by looking at the previous image that showed the first five rows in tags.csv, I would say that “so bad it’s good” was not a good description a movie—and there are many tags similar to that phrase. Thankfully, what these opinionated opinions have in common are words with one or two letters, such as “so”, “if”, and “a”. By removing all tags with short words, I expected to filter out many tags that were not too helpful in describing the movie. However, as you can see in the code, there are three separate if-statements just to do this one task. That is because tags that contain words “based” and “sci-fi” are useful and would have been removed since they contain a two or one letter word, such as “based on” and the “fi” in sci-fi would have been removed by the last if-statement.
仅通过查看在tag.csv中显示前五行的上一张图像,我就可以说“好极了”对电影不是一个很好的描述,而且有很多类似于该短语的标签。 值得庆幸的是,这些意见一致的地方是带有一个或两个字母的单词,例如“ so”,“ if”和“ a”。 通过删除所有带有短词的标签,我希望过滤掉许多对描述电影不太有用的标签。 但是,正如您在代码中看到的那样,只有三个单独的if语句可以完成此任务。 这是因为包含单词“ based”和“ sci-fi”的标签很有用,并且由于包含两个或一个字母的单词而被删除,例如,“基于”和“ ci”中的“ fi”将具有由最后的if语句删除。
[Just to mention, most of the tags seemed to be spell correctly.]
[仅提及,大多数标签似乎拼写正确。]
RATINGS.CSV:定义喜欢和不喜欢 (RATINGS.CSV: Defining Like and Dislike)
To make a clear definition of what movies a user liked or disliked, I defined “like” to mean that the user gave the movie a 4.0+ rating—anything lower was a dislike. Since this was a simple distinction to code and the ratings.csv was already large, adding a new column(s) to ratings.csv to distinguish between like and dislike was not performed.
为了明确定义用户喜欢或不喜欢的电影,我定义“喜欢”是指用户给电影评分4.0+,任何低的都不喜欢。 由于这是对代码的简单区分,并且ratings.csv已经很大,因此未执行向ratings.csv添加新列以区分喜欢和不喜欢的行为。
This is when it would be best to create and use a subset of ratings.csv if the system does not have a lot of RAM storage (<16 GB).
如果系统的RAM存储量不足(<16 GB),则最好创建并使用ratings.csv的子集。
特征工程:谁喜欢什么而不喜欢什么? (Feature Engineering: Who Liked What and Disliked What?)
Here is where things might become more complicated: how to tell the machine learning models what a user likes and dislikes. The approaches I took were different for the genres and tags of the movies that the users have previously watched. I decided on using three models to generate a final prediction: genres model, tags model, and combined model.
这可能会变得更加复杂:如何告诉机器学习模型用户喜欢和不喜欢的东西。 对于用户以前观看过的电影的类型和标签,我采取的方法是不同的。 我决定使用三种模型来生成最终的预测:体裁模型,标签模型和组合模型。
流派模型:扩展流派兴趣 (Genres Model: Scaling Genres Interests)
To create the inputs for the genres model, I created a genres profile for each user where all of the movies that they’ve liked were scaled on a 0–1 range, where adding all the scaled values for the genres would result in a total of 1—also doing the same for disliked movies. For the movie in question, since the genres were already processed in a numeric-categorical format in individual genre columns, the movie profile was already ready to model input.
为创建流派模型的输入,我为每个用户创建了一个流派配置文件,他们喜欢的所有电影都在0–1范围内缩放,将所有流派的缩放值相加将得出总计之1-对不喜欢的电影也是如此。 对于有问题的电影,由于已经以数字分类格式在各个类型列中处理了这些类型,所以电影资料已经准备好为输入建模。
The reason for scaling was mainly to minimize any bias created by the models for users that have rated many movies over users that have rated only a few movies. For example, a straightforward approach would be to add up all the genres from the movies the user liked and directly feed that into the model. If Jessica rated 20 movies and Todd only rated 2 movies, Jessica’s profile might have a value of 20 in the Action genre and Todd having 2 in the same genre. The model will most likely try to make a size relationship between 20 and 2, thinking that Jessica likes Action movies a lot and Todd just somewhat might like Action movies. However, with the scaling approach that I used, Jessica and Todd would have high values for the Action genre and be judged on the same scale. But, a caveat to my scaling is that Todd would be seen to REALLY(!) like Action movies but Jessica just likes Action movies since Jessica’s profile will be much more diverse with other genres.
进行缩放的原因主要是为了使模型对评级多部电影的用户产生的偏见最小化,而对评级仅几部电影的用户产生的偏差最小。 例如,一种直接的方法是将用户喜欢的电影中的所有流派加起来,并将其直接输入模型中。 如果杰西卡(Jessica)对20部电影评分,而托德(Todd)仅对2部电影评分,那么在动作类型中,杰西卡(Jessica)的个人资料可能值为20,而在同一类型中,托德(Todd)的评分为2。 该模型很可能会尝试使尺寸关系介于20和2之间,以为杰西卡非常喜欢Action电影,而Todd可能有点喜欢Action电影。 但是,使用我使用的缩放方法,杰西卡(Jessica)和托德(Todd)的动作类型将具有很高的价值,并且可以在相同的等级上进行判断。 但是,需要注意的是,Todd会像Action电影一样真正地被(!)看到,但Jessica却喜欢Action电影,因为Jessica的个人资料与其他类型的电影相比会更加多样化。
标签模型:短语向量化 (Tags Model: Phrase Vectorization)
Numerically representing words is a requirement for machine learning models and most NLP libraries have tools to do word vectorization, but many of them do not vectorize phrases or sentences. At the start of the project, I did not think I would use the tags because of one reason—could the models find the relationship between the vectors when the vectors were not organized in any specific method?
数字表示单词是机器学习模型的要求,并且大多数NLP库都具有进行单词矢量化的工具,但是其中许多工具无法对短语或句子进行矢量化。 在项目开始时,由于一个原因,我不认为我会使用这些标签-当不以任何特定方法组织向量时,模型能否找到向量之间的关系?
To help the models learn, I did a bit more pre-processing on tags.csv. I wanted to remove all uncommon tags to shrink the vector dictionary that would be created later. So, with the amazing Pandas GroupBy function, it was a simple task to find common tags:
为了帮助模型学习,我对tags.csv做了一些预处理。 我想删除所有不常见的标签,以缩小稍后创建的矢量字典。 因此,借助令人惊叹的Pandas GroupBy功能,查找通用标签是一项简单的任务:
Then, I simply vectorized the common tags. When looping through the tags, if the tag was not in the vector dictionary, it was simply skipped because it was considered to be uncommon. To create the tags profile for each user, I added up all the tags for the movies the user liked and disliked separately and only kept the 20 most tag counts. This allows for more general and uniform profiles across all users rather than focusing only on the tags that the user created. After [a long time of] processing, this is resulting DataFrames for each user and movie tag profiles:
然后,我简单地矢量化了通用标签。 遍历标签时,如果标签不在矢量词典中,则会被跳过,因为它被认为是不常见的。 为了为每个用户创建标签配置文件,我分别添加了用户喜欢和不喜欢的电影的所有标签,并且仅保留了最多20个标签。 这允许在所有用户上提供更通用和统一的配置文件,而不是只关注用户创建的标签。 经过长时间的处理,这将为每个用户和影片标签配置文件生成DataFrame:
模型训练 (Model Training)
Now that the inputs were created for two of the machine learning models, the models needed to be created and trained.
现在已经为两个机器学习模型创建了输入,现在需要创建和训练模型。
类型模型:神经网络/深度学习 (Genres Model: Neural Network/Deep Learning)
Here I used Keras/TensorFlow (GPU) for the neural network modeling:
在这里,我使用Keras / TensorFlow(GPU)进行神经网络建模:
As explained in the code comments, this model intakes the user’s liked and disliked genre profiles, and the genre profile of the movie in question as three separate inputs. Then, using a concatenating layer, the three branches are combined together. The output layer is set to the sigmoid activation function because I wanted the predictions to be capped at 5 (the max rating on the rating scale). This would mean that the label/ratings would need to be scaled before training (divide by 5) and rescaled back up after predicting.
如代码注释中所述,该模型将用户喜欢和不喜欢的类型简介以及相关电影的类型简介作为三个单独的输入。 然后,使用连接层将三个分支合并在一起。 将输出层设置为S形激活函数,因为我希望将预测的上限限制为5(评级量表上的最大评级)。 这意味着标签/等级需要在训练之前进行缩放(除以5),并在预测后重新缩放。
标签型号:随机森林 (Tags Model: Random Forest)
For the tags model, I decided to use a random forest model since the input variables were descending in popularity — therefore, the importance of the variables can be determined by random forest.
对于标签模型,由于输入变量的流行程度下降,我决定使用随机森林模型-因此,变量的重要性可以由随机森林确定。
警告! (WARNING!)
Normally, optimizing for the hyperparameter would be required. However, each tree with default parameters took a large amount of RAM. In my system of 48 GB of RAM, I was only able to max out at 100 trees with occasional shutdowns due to Python running out of RAM space. This is when shortening the length of each tree is necessary for systems with less RAM storage. From testing, using less data during the training phase does not largely impact the prediction results.
通常,需要对超参数进行优化。 但是,具有默认参数的每棵树都占用大量RAM。 在具有48 GB RAM的系统中,由于Python内存不足,我偶尔只能关闭100棵树。 在这种情况下,对于RAM存储量较少的系统,必须缩短每棵树的长度。 从测试来看,在训练阶段使用较少的数据不会对预测结果产生很大的影响。
组合模型:线性回归 (Combined Model: Linear Regression)
The last model takes the predictions from both genres and tags models and outputs a final prediction using linear regression.
最后一个模型从类型和标签模型中获取预测,并使用线性回归输出最终预测。
结果:模型的统计分析 (Results: Statistical Analysis of Models)
Now the important part of this whole article — the results:
现在,这是整篇文章的重要部分-结果:
Genres Model:
类型:
Tags Model:
标签型号:
Combined Model:
组合型号:
First, I should explain the custom statistical term “FLEX” seen in all the images above. FLEX for ratings means that the regression predictions needed to be within a range of +/-0.5 from the actual label to be considered correct. For predicting like or dislike, the FLEX decision boundary for considering what was a liked movie is lowered to 3.5+ instead of 4.0+. The reasoning for making the statistics flexible is that on the original rating scale, all the ratings were in a 0.5 increment, but any regression models will predict any values between or on the increments by default. And, to me, predicting a value close to the actual should still be considered since predicting the actual rating VS close rating should not make too much of a difference when suggesting movie recommendations.
首先,我应该解释在所有以上图像中看到的自定义统计术语“ FLEX ”。 评分的FLEX意味着回归预测必须在实际标签的+/- 0.5范围内才能被认为是正确的。 为了预测喜欢或不喜欢,考虑喜欢电影的FLEX决策边界降低为3.5+,而不是4.0+。 使统计信息更加灵活的原因是,在原始评分等级上,所有评分均以0.5为增量,但是默认情况下,任何回归模型都将预测增量之间或增量上的任何值。 而且,对我而言,仍应考虑预测接近实际值,因为在建议电影推荐时,预测实际收视率与接近收视率之间的差异不会太大。
When examining the performance of all three models, the statistics do show that the performance improves in the combined model—but not by as much as I had hoped. However, improvement is an improvement, especially at the training and prediction speed of a small linear regression model.
在检查所有三个模型的性能时,统计数据确实表明组合模型的性能有所提高,但没有达到我的期望。 但是,改进是一种改进,尤其是在小型线性回归模型的训练和预测速度方面。
[Although I used regression models, I can infer like/dislike using the definition for like (rating of 4.0+). Switching to classification models will yield slightly better results in predicting like/dislike but would not be useful for determining which movies to recommend first since there would not be any values to rank categorical predictions.]
[尽管我使用了回归模型,但我可以使用喜欢(4.0+评分)的定义来推断喜欢/不喜欢。 切换到分类模型,在预测喜欢/不喜欢时会产生更好的结果,但是由于没有任何值可以对分类预测进行排名,因此无法确定首先推荐的电影。
十大电影推荐:用户6550 (Top 10 Movie Recommendations: User 6550)
Looking at the user 6550 as an example, the combined model recommended various animations, drama, and war movies — matching 6550’s like and dislike genre profiles. Oddly, the artwork of the four animation movies have similar Japanese anime style—perhaps tied in by the tags? If this is the case, then the tags model might have been able to make the connections between the tag vectors that I feared it would not have been able to do.
以用户6550为例,组合模型推荐了各种动画,戏剧和战争电影-匹配6550的喜欢和不喜欢类型。 奇怪的是,这四部动画电影的艺术品具有相似的日本动漫风格-可能与标签捆绑在一起了吗? 如果是这种情况,那么标签模型可能已经能够在标签矢量之间建立连接,而我担心这是无法做到的。
[Since the tag profile was not as intuitive as the genre profile without transforming the tag vectors back to the original tags, it was not shown here. The user 6550 was randomly chosen and coincidentally was one of the users who has had made many ratings.]
[由于没有将标签向量转换回原始标签,标签配置文件不如类型配置文件直观,因此此处未显示。 用户6550是随机选择的,并且恰好是获得许多评分的用户之一。]
结论 (Conclusion)
After seeing the results, I would say that my recommendation system worked well enough with a ~73% chance of predicting correctly on whether a person would like/dislike a movie.
看完结果后,我会说我的推荐系统运行良好,大约有73%的机会可以正确预测某个人是否喜欢/不喜欢电影。
If wanting to improve the predictive performance, I would first examine which users the models are having trouble predicting correctly and see if there is a correlation between those users. One possibility is that the models might have a low chance of predicting correctly for users who rated only a few movies. If this is the case, then using collaborative filtering could help the problem by projecting such users to mimic similar users who had rated a lot of movies.
如果要提高预测性能,我首先要检查哪些用户无法正确预测模型,然后查看这些用户之间是否存在关联。 一种可能性是,对于仅对几部电影进行评级的用户,模型可能很难正确预测。 如果是这种情况,那么使用协作过滤可以通过预测此类用户模仿对许多电影评分的相似用户来解决该问题。
Lastly, how would a streaming service use this project? One, they can use it for its intended purpose of recommending movies to its customers. Two, they can also use it to determine which movies to add and remove from their movie selection. Third, it would also help them to understand current trends, the interests of their customers, and, if the streaming service produces movies, what genres of movies to focus their production on.
最后,流媒体服务将如何使用此项目? 第一,他们可以将其用于向客户推荐电影的预期目的。 第二,他们还可以使用它来确定要添加或从电影选择中删除的电影。 第三,这还将帮助他们了解当前的趋势,客户的兴趣,以及,如果流媒体服务制作电影,则将哪些类型的电影重点放在其制作上。
[I encourage readers to share their thoughts and experiences with recommendation systems! I’m still learning so any input would be helpful.]
[我鼓励读者与推荐系统分享他们的想法和经验! 我仍在学习,因此任何输入都会有所帮助。]
翻译自: https://medium.com/swlh/recommendation-system-for-movies-movielens-grouplens-171d30be334e
movielens推荐