基于MovieLens的电影推荐系统

This summer I was privileged to collaborate with Made With ML to experience a meaningful incubation towards data science. I chose the awesome MovieLens dataset and managed to create a movie recommendation system that somehow simulates some of the most successful recommendation engine products, such as TikTok, YouTube, and Netflix.

Ť他的夏天,我有幸与合作由具有ML体验对数据的科学有意义的孵化。 我选择了很棒的MovieLens数据集,并设法创建了一个电影推荐系统,该系统以某种方式模拟了一些最成功的推荐引擎产品,例如TikTok,YouTube和Netflix。

This article is going to explain how I worked throughout the entire life cycle of this project, and provide my solutions to some technical issues.

本文将解释我在该项目的整个生命周期中的工作方式,并提供一些技术问题的解决方案。

主意 (Ideas)

At first glance at the dataset, there are three tables in total:

乍一看数据集,总共有三个表:

  • movies.csv: This is the table that contains all the information about the movies, including title, tagline, description, etc. There are 21 features/columns totally, so we candidates can either just focus on some of them or try utilizing all of them.

    films.csv :此表包含有关电影的所有信息,包括标题,标语,描述等。总共有21个功能/列,因此我们的候选人可以只专注于其中某些功能,也可以尝试利用所有这些功能/列他们。

  • ratings_small.csv: A table that records all the users’ rating behaviors, covering their rates and the time stamp when they posted the rates.

    rating_small .csv:该表记录所有用户的评分行为,包括其费率和发布费率时的时间戳。

  • links.csv: A table that records each movie’s unique ID on two respective movie database: IMDB and TMDB.

    links.csv :一个表,用于在两个相应的电影数据库(IMDB和TMDB)上记录每个电影的唯一ID。

There are two common recommendation filtering techniques: collaborative filtering and content filtering. Collaborative filtering requires the model to learn the connections/similarity between users so that it can generate the best recommendation options based on users’ previous choices, preferences, or tastes. And content filtering needs the profile of both the users and the items so that the system can determine the recommendation according to users’ and items’ common properties.

有两种常见的推荐过滤技术:协作过滤和内容过滤。 协作过滤要求模型学习用户之间的联系/相似性,以便它可以根据用户的先前选择,偏好或喜好生成最佳推荐选项。 内容过滤需要用户和项目的配置文件,以便系统可以根据用户和项目的共同属性确定推荐。

Now I am going to try both of them step by step.

现在,我将逐步尝试它们。

协同过滤 (Collaborative Filtering)

Collaborative filtering just requires me to keep track of users’ previous behaviors, say, how much they preferred a movie in the past. And fortunately, we are already provided with this sort of information because the data in table ratings_small.csv exactly reflects this. To implement this technique, I applied the wonderful Python Library Surprise. It provides a set of built-in algorithms that are commonly used in recommendation system development. I chose 5 methods to compare their accuracy with RMSE as the measure and the result is as follows:

协作过滤仅要求我跟踪用户以前的行为,例如他们过去喜欢电影的程度。 幸运的是,已经为我们提供了这类信息,因为表ratings_small.csv中的数据恰好反映了这一点 为了实现此技术,我应用了精彩的Python Library Surprise 。 它提供了一组推荐系统开发中常用的内置算法。 我选择了5种方法,将它们的精度与RMSE进行比较,结果如下:

基于MovieLens的电影推荐系统_第1张图片
Accuracy of algorithms for recommendation system (Image by Author) 推荐系统算法的准确性(作者提供的图片)

SVD outperforms any other counterpart and here is the snippet of the final recommendation (of course, configured with SVD) list for each user will be like:

SVD的表现胜过其他任何同行,这是最终建议(当然,已配置SVD)列表的摘要,每个用户的列表将类似于:

基于MovieLens的电影推荐系统_第2张图片
Image by Author 图片作者

The most obvious advantage of collaborative filtering is its easy implementation. It does not require too detailed information towards the users and items, and ideally, it can be achieved with 5 lines of codes.

协作过滤的最明显优势是易于实现。 它不需要有关用户和物品的详细信息,理想情况下,可以用5行代码来实现。

内容过滤 (Content Filtering)

Even though the collaborative filtering technique has its outstanding advantage, its other side of the coin is also apparent: it can not resolve the “cold start” problem. This problem refers to the situation where a new item or a new user added to the system and the system has no way to either promote the item to the consumers or suggest the user any available options. This is due to that the system doesn’t keep track of the properties of users and items. Unless users start rating the new item, it will not be promoted; and likewise, the system has no idea what to recommend until the user starts to rate.

即使协作过滤技术具有突出的优势,但硬币的另一面也很明显:它无法解决“冷启动”问题。 此问题是指以下情况:新商品或新用户添加到系统中,并且系统无法将商品推广给消费者或向用户建议任何可用选项。 这是因为系统无法跟踪用户和项目的属性。 除非用户开始对新项目进行评分,否则不会对其进行推广; 同样,在用户开始评分之前,系统不知道该推荐什么。

基于MovieLens的电影推荐系统_第3张图片
link) 链接 )

And content filtering is the solution to it. It enables the system to understand users’ preferences when the user/item profiles are provided. For example, if a user’s playlist contains Justice League, Avengers, Aquaman, and The Shining, chances are that he/she prefers the action and horror genres. If using collaborative filtering, this user would be suggested some comedies because other audience who watched Justice League, Avengers, Aquaman, and The Shining watched comedies. This sometimes doesn’t make sense if this certain user doesn’t like comedies at all. But with content filtering, such an issue can be avoided since the system has been acknowledged what the preference of this user is.

内容过滤是解决方案。 当提供用户/项目配置文件时,它使系统能够了解用户的偏好。 例如,如果用户的播放列表包含《 正义联盟》 ,《 复仇者联盟》,《 海王》《闪灵》 ,则可能是他/她更喜欢动作和恐怖类型。 如果使用协作过滤,则会向该用户建议一些喜剧,因为其他观看正义联盟 ( Justice League)复仇者 联盟 ( Avengers), AquamanThe Shining的观众都观看了喜剧。 如果该特定用户根本不喜欢喜剧,有时这是没有意义的。 但是使用内容过滤,可以避免这种问题,因为系统已经确认该用户的喜好。

To implement a content-filtering recommendation system, I utilized TFIDF to reflect the importance of each genre in any movie (I only considered genres at this stage). And then I calculated the sum product of the importance weights and users’ preferences towards different genres (given in user profile). Based on the sum-product, we could simply sort movies and suggest the users the top N candidates as the recommendations.

为了实施内容过滤推荐系统,我利用了TFIDF来反映每种类型在任何电影中的重要性(我在此阶段仅考虑了类型)。 然后,我计算出重要性权重和用户对不同流派的偏好的总和(在用户个人资料中给出)。 基于总和,我们可以简单地对电影进行排序,并向用户推荐前N名候选人作为推荐。

如果我是新手怎么办? (What if I’m new?)

As the previous code snippet shows, I created the user/movie profile based on the existing users’ rating records in history. It has not entirely solved the cold start problem yet nevertheless because the system still has no idea what to do for the new users or with the new movies. I will tell you how I extract the genre information from the movie posters in the rest of this article and now I am going to show how the system should respond to a new user.

如先前的代码片段所示,我基于历史记录中现有用户的评分记录创建了用户/电影资料。 但是,由于系统仍然不知道要为新用户或新电影做什么,因此它尚未完全解决冷启动问题。 在本文的其余部分,我将告诉您如何从电影海报中提取类型信息,现在,我将展示系统如何响应新用户。

I assume that new users have two mindsets: they understand either what kinds of movies they want or nothing. For the first type of customers, I allow them to choose whichever genres at their will and simply let the system return according to their self-provided preferences.

我假设新用户有两种心态:他们了解他们想要什么类型的电影,或者什么都不知道。 对于第一类客户,我允许他们根据自己的意愿选择任意一种类型,并根据自己提供的偏好让系统返回。

For those who have not known what to do yet, I implemented part of the work of Tobias Dörsch, Andreas Lommatzsch, and Christian Rakow. I made the system scrape the most popular twitter accounts whose focus is on movies as soon as the new user without any preferences requests. Then I matched the most frequently mentioned named entities, which were recognized by spaCy, with the movies. The matched movies are supposed to the ones most likely popular because of their close similarity to the persons/movies of the current time.

对于那些尚不知道该怎么做的人,我实施了TobiasDörsch,Andreas Lommatzsch和Christian Rakow的部分工作 。 我让系统在没有任何首选项要求的情况下,在新用户刮擦最流行的Twitter帐户(将重心放在电影上)后立即将其删除。 然后,我将被spaCy识别的最常提到的命名实体与电影进行匹配。 匹配的电影被认为是最有可能流行的电影,因为它们与当前时间的人物/电影非常相似。

基于MovieLens的电影推荐系统_第4张图片
Tobias Dörsch, Andreas Lommatzsch, and Christian Rakow’s Topical Video-On-Demand Recommendations based on Event Detection (Source: http://dl.icdst.org/pdfs/files/1cd028f7a702b291a00984c192f687db.pdf) TobiasDörsch,Andreas Lommatzsch和Christian Rakow基于事件检测的主题视频点播建议(来源: http : //dl.icdst.org/pdfs/files/1cd028f7a702b291a00984c192f687db.pdf )

如何发行新电影? (How to release new movies?)

A well-established movie streaming platform would introduce new movies constantly. I wanted to simulate this behavior and my idea was that whenever there are new movies starting streaming, they can get recommended in the content filtering recommendation system even though their production companies do not provide their genre information. Now I am going to introduce a method that applies CV to generating the genres automatically.

完善的电影流媒体平台将不断推出新电影。 我想模拟这种行为,我的想法是,每当有新电影开始流传输时,即使其制作公司不提供其类型信息,也可以在内容过滤推荐系统中对其进行推荐。 现在,我将介绍一种将简历应用于自动生成类型的方法。

Image for post
Thousands of movies have no genres labeled yet, I viewed them as the newly released movies that have not been labeled yet. (Image by Author) 成千上万的电影都没有标签,我将它们视为尚未发行标签的新发行电影。 (图片由作者提供)

The genre labeling in my case is a multi-label classification owing to that a movie is likely to be labeled with more than one genre. My solution is to transfer a pre-trained model Mobile Net to this problem. What is important is to decide the threshold that determines whether a movie is considered to have a certain genre. In this repo, Ashref Maiza told us that his customed macro soft-F1 is a better choice of loss function than the built-in binary cross-entropy because it learns to be less “hesitating” and consequently the performance of the system does not change too much when varying the threshold between the middle range. Therefore, my final automatic classification will be like:

在我的案例中,流派标签是多标签分类,因为电影可能会被标记不止一种流派。 我的解决方案是将预训练的模型Mobile Net转移到此问题。 重要的是确定确定电影是否被视为具有特定类型的阈值。 在此回购中 , Ashref Maiza告诉我们,与内置的二进制互熵相比,他的自定义宏soft-F1是损失函数的更好选择,因为它学会了更少的“犹豫”,因此系统的性能不会改变。在中间范围之间更改阈值时太多。 因此,我最终的自动分类将是:

基于MovieLens的电影推荐系统_第5张图片
Image by Author 图片作者

部署方式 (Deployment)

I wrapped what I researched in the previous sections and managed to develop a web application using Streamlit. Just feel free to have fun with it on https://recommendation-sys.herokuapp.com/.

我包装了前几节中研究的内容,并设法使用Streamlit开发了一个Web应用程序。 随时在https://recommendation-sys.herokuapp.com/上享受它的乐趣。

基于MovieLens的电影推荐系统_第6张图片
Recommendation System Screenshot (Image by Author) 推荐系统屏幕截图(作者提供)

结论 (Conclusion)

This is my first simulation of some state-of-art recommendation engines. I leveraged my knowledge in NLP and CV, especially content/collaborative filtering recommendation and multi-label classification.

这是我对某些最新推荐引擎的首次模拟。 我利用了我在NLP和CV中的知识,尤其是内容/协作过滤推荐和多标签分类。

I should admit that there is still a huge space for this project to improve and here are some of my future concentrations:

我应该承认,这个项目还有很大的改进空间,这是我未来的工作重点:

  1. Utilize more information of the given dataset. You might still remember that I once mentioned there are 21 columns in the table movies.csv. But I just used genres for an easy demonstration. With more information input, it is believed that the recommendation will be more personalized and targeted.

    利用给定数据集的更多信息 。 您可能还记得我曾经提到过表movie.csv中有21列。 但是我只是使用流派来进行简单的演示。 随着更多信息的输入,相信该推荐将更加个性化和针对性强。

  2. Use more advanced recommendation techniques. Many recommendation engines developed by some big-name brands are more and more sophisticated, and their logic behind is also more and more in-depth. Model-based filtering, hybrid filtering are some of the recently emerging technologies.

    使用更高级的推荐技术 。 一些知名品牌开发的推荐引擎越来越复杂,其背后的逻辑也越来越深入。 基于模型的过滤,混合过滤是最近出现的一些技术。

  3. Eliminate the “filter bubble”. A user perhaps can only watch the movies recommended by the system, and the recommendation is based on his/her previous watch history. In this case, other movies that don’t align with their preferences are not available to the users, which makes the users look like trapped in a “bubble”. Nonetheless, some users are still welcome to other types so that the recommendation system should find a balance point between recommending similar movies and the other.

    消除“滤泡” 。 用户也许只能观看系统推荐的电影,并且该推荐基于他/她先前的观看历史。 在这种情况下,其他与他们的喜好不一致的电影对用户不可用,这使用户看起来像陷入了“泡沫”。 但是,仍然欢迎某些用户使用其他类型的影片,以便推荐系统应在推荐相似电影和其他电影之间找到平衡点。

  4. Recommend movies based on recent events. This sounds quite similar to what I did to the new users when they do not provide their preferences. The difference is that what I did is simply recommending them the movies with the recent hit persons (actors, filmmakers, etc.) involved, while we could have made the recommendation system smarter, which means understanding something happening right now, even if it is not movie-relevant, and recommending the related movies. This would an interesting discovery, given that millions of, if not billions of, people are stuck at home by COVID-19 these days and they might be willing to experience how humans eventually conquered a virus attack in the movie.

    根据最近的活动推荐电影 。 这听起来与我对新用户不提供偏好时所做的操作非常相似。 所不同的是,我所做的只是向他们推荐与近期热门人物(演员,电影制片人等)有关的电影,而我们本可以使推荐系统更智能,这意味着了解当前发生的事情,即使它是与电影无关,并推荐相关电影。 这是一个有趣的发现,因为如今有数百万甚至数十亿的人被COVID-19困在家里,他们可能愿意体验人类最终如何在电影中战胜病毒攻击。

If you are interested in my project and willing to contribute to it, please feel free to visit here:

如果您对我的项目感兴趣并且愿意为我的项目做出贡献,请随时访问这里:

翻译自: https://towardsdatascience.com/movie-recommendation-system-based-on-movielens-ef0df580cd0e

你可能感兴趣的:(python,java,linux,推荐系统,人工智能)