强化学习推荐系统

This story appeared in my weekly writing on making automation and AI accessible and fair, at Democratizing Automation.

我在《 民主化自动化》 ( Democratizing Automation) 上发表的有关使自动化和AI易于访问和公平的每周文章中出现了这个故事 。

This encompasses a brief intro to recommendation systems online, ethics of such systems, ways we can formulate human reward, and views of human recommendations as a slowly updated batch reinforcement learning problem (the game, where our reward is not even in the loop).

这包括对在线推荐系统的简要介绍，此类系统的道德规范，我们制定人为奖励的方式以及将人为推荐作为缓慢更新的批量强化学习问题( 游戏，我们的奖励甚至不在循环中)的观点。

点击率导致点击诱饵 (Clickthrough led to clickbait)

Clickthrough (a heuristic of engagement, reward) was an early metric used to create recommendations. The clickbait problem (open link, close it immediately) led to a dwell time metrics on pages. I have been satisfactorily burnt out of clickbait sites, so I started this direct-to-reader blog. That’s just the surface level effect for me, and I am sure there’s more. At one point, Facebook’s go-to metric for a while was the usage of the click to share button — okay, you get the point, on to the paper.

点击率(一种参与度，奖励的启发式方法)是用于创建推荐的早期指标。点击诱饵问题(打开链接，请立即将其关闭)导致页面停留时间指标。我已经从clickbait网站上被满意地淘汰了，所以我开始了这个直接面向读者的博客。这对我来说只是表面水平的影响，而且我相信还有更多。曾经有一段时间，Facebook的首选指标是点击共享按钮的使用-好的，您明白了，到了纸上。

I found a paper from a workshop on Participatory Approaches to Machine Learning (when you look closer, there are many great papers to draw on, I will likely revisit) at the International Conference on Machine Learning last week. When you see block quotes, they are from What are you optimizing for? Aligning Recommender Systems with Human Values, Stray et. al, 2020. There’s some great context on how the systems are used and deployed.

我在上周的国际机器学习大会上从一个参与式机器学习方法 研讨会上找到了一篇论文 (当您仔细观察时，有很多很棒的论文可以借鉴，我很可能会重新审视) 。当您看到块引用时，它们来自于您要优化的内容？ 使推荐系统与人的价值保持一致 ，Stray等。等，2020年。关于如何使用和部署系统有一些很好的背景。

Most large production systems today (including Flipboard (Cora, 2017) and Facebook (Peysakhovich & Hendrix, 2016)) instead train classifiers on human-labelled clickbait and use them to measure and decrease the prevalence of clickbait within the system.

如今，大多数大型生产系统(包括Flipboard(Cora，2017)和Facebook(Peysakhovich＆Hendrix，2016))都在人工标记的点击诱饵上训练分类器，并使用它们来衡量和降低系统中点击诱饵的流行程度。

Human labeled content is a bottleneck when generated content outpaces labeling capacity. Also, labeling a classifier will be outdated immediately at deployment (constantly moving test set). There are also companies that won’t describe how they operate. Another point regarding industry usage I found interesting is:

当生成的内容超过标签容量时，人标签内容是一个瓶颈。同样，在部署时(不断移动测试集)，标记分类器将立即过时。也有一些公司不会描述他们的经营方式。我发现与行业使用有关的另一点是：

Spotify is notable for elaborating on the fairness and diversity issues faced by a music platform. Their recommendation system is essentially a two-sided market, since artists must be matched with users in a way that satisfies both, or neither will stay on the platform.

Spotify着重阐述音乐平台面临的公平性和多样性问题。 他们的推荐系统本质上是一个双向市场，因为艺术家必须与用户匹配，并且要兼顾两者，否则任何一个都不会停留在平台上。

And, obvious comment below, but requisite.

并且，以下明显的注释是必要的。

Especially when filtering social media and news content, recommenders are key mediators in public discussion. Twitter has attempted to create “healthy conversation” metrics with the goal to “provide positive incentives for encouraging more constructive conversation” (Wagner, 2019).

特别是在过滤社交媒体和新闻内容时，推荐者是公共讨论中的关键媒介。 Twitter已尝试创建“健康对话”指标，目标是“提供积极诱因以鼓励进行更具建设性的对话”(Wagner，2019年)。

My impression of the learned models is: If the big companies do it, it’s because it works. Again, don’t assume malintent, assume profits. Now that we have covered how the companies are using their platforms to addict us to their advertisements, here is a small update to our model — a feedback loop and bidirectional arrows.

我对博学的模型的印象是： 如果大公司这样做，那是因为它可行。 再说一次，不要承担罪恶，要承担利润。既然我们已经介绍了公司如何使用他们的平台使我们沉迷于广告，这是我们模型的一个小更新-反馈循环和双向箭头。

Source — author. 来源—作者。

推荐系统的伦理 (Ethics of Recommender Systems)

Our computers are deciding what to put in front of us, primarily so that the companies retain us as reliable customers. What could go wrong? What are you okay with robots recommending for you? Your social media content — okay. How I decide my career path — I don’t know.

我们的计算机正在决定要摆在我们面前的东西，主要是为了让公司保留我们作为可靠的客户。可能出什么问题了？您可以为机器人推荐什么？您的社交媒体内容-好的。我如何决定自己的职业道路-我不知道。

I don’t blame companies for making these tools and putting them in front of us — they want to make money after all. These issues will come to the forefront as the negative effects compound over the next few years. Here are a few points where I don’t think companies are held to high enough standards:

我不怪公司制造这些工具并将它们摆在我们面前—它们毕竟想赚钱。这些问题将在未来几年成为最不利的负面影响。我认为公司在某些方面没有达到足够高的标准：

Financial Technology (Fintech) Companies: manipulate your brain into engaging with financial products in different ways, which has had more dramatic effects on people without certain financial stability.
金融技术公司(Fintech) ：操纵您的大脑以不同的方式参与金融产品的交易，这对没有一定金融稳定性的人们产生了更大的影响。
High-traffic Media Platforms: Beyond the simple point of the hours you spend online each day, or how Google dictates everything you see, technology companies have tried to “be the internet” in developing nations. Click the link to see what happened when Facebook tried to be the internet in India (they were nice enough to include Wikipedia, though!).
高流量的媒体平台：除了您每天在网上花费数小时的简单点，或者Google决定您所看到的一切之外，科技公司还试图在发展中国家“ 成为互联网 ”。单击链接以查看Facebook尝试成为印度的互联网时发生了什么(尽管足够好，可以包含Wikipedia！)。
News Sources: Mainstream newsrooms (and definitely fringe sites, and everything in between) use automated methods to tune what news is given to you. I see a future where they tune the writing style to better match your views, too. Conformism is not progressive.
新闻来源：主流新闻编辑室(当然还有边缘站点，以及介于两者之间的所有内容)使用自动化方法来调整向您提供的新闻。我看到了他们可以调整写作风格以更好地符合您的观点的未来。顺从主义不是进步的。

I want to start with what has been called the Value Alignment Problem in at-scale, human-facing AI (example paper on legal contracts, AI, and value alignment Hadfield-Menell & Hadfield, 2019).

我想从规模化，面向人的AI中所谓的价值对齐问题开始(关于法律合同，人工智能和价值对齐的示例文件Hadfield-Menell＆Hadfield，2019年 )。

与人类受试者打交道的低级道德 (Low-level ethics of working with human subjects)

I define the ethical problem here as short term results (highlighted below) and long term mental-rewiring of humans whose lives are run by algorithms.

在这里，我将道德问题定义为短期结果(以下突出显示)和对生活由算法运行的人类的长期心理回顾。

Concerns about recommender systems include the promotion of disinformation (Stocker, 2019), discriminatory or otherwise unfair results (Barocas et al., 2019), addictive behavior (Hasan et al., 2018)(Andreassen, 2015), insufficient diversity of content (Castells et al., 2015), and the balkanization of public discourse and resulting political polarization (Benkler et al., 2018).

对推荐系统的担忧包括虚假信息的推广( Stocker，2019 )，歧视性或其他不公平的结果( Barocas等，2019 )，成瘾行为( Hasan等，2018 )( Andreassen，2015) ，内容多样性不足( Castells等人，2015年 )，以及公共话语的巴尔干化和由此产生的政治分化( Benkler等人，2018年 )。

Stray et. al, 2020 continue and introduce the Recommender Alignment Problem. It is a specific version of the value alignment problem that could have increased emergence because of the prevalence of the technologies in our lives. If at this point you aren’t thinking about how they affect you, have you been reading closely? Finally, a three-phase approach to alignment:

流浪等 2020年等人继续介绍“ 推荐人调整问题” 。这是价值一致性问题的一个特定版本，由于我们生活中技术的普及，它可能会越来越多地出现。如果此时您不考虑它们如何影响您，您是否一直在仔细阅读？最后，采用三个阶段的对齐方式：

We observe a common three phase approach to alignment: 1) relevant categories of content (e.g., clickbait) are identified; 2) these categories are operationalized as evolving labeled datasets; and 3) models trained off this data are used to adjust the system recommendations

我们观察到一种通用的三阶段对齐方法：1)确定相关类别的内容(例如，点击诱饵)； 2)这些类别可作为不断发展的标记数据集进行操作； 和3)根据这些数据训练的模型用于调整系统建议

This can be summarized as identification (of content and issues), operalization (of models and data), adjustment of deployment. This sounds relatively close to how machine learning models are deployed to start with, but it is detailed below.

这可以概括为：识别(内容和问题)，操作(模型和数据)，部署调整。这听起来与开始部署机器学习模型的方式相对接近，但下面将对其进行详细介绍。

高层推荐系统调整： (High-level, recommender system adjustments:)

The high-level ideas again are from the paper, but the comments are my own.

高层想法还是来自论文，但是评论是我自己的。

Useful definitions and measures of alignment — companies need to create research on internet metrics that better match user expectations and accumulated harm (or uplifting!)
有用的定义和对齐方式-公司需要对互联网指标进行研究，以更好地匹配用户的期望和累积的危害(或令人振奋！)
Participatory recommenders — having humans in the loop for content will enable much better matching of human reward to modeled reward, which long term will pay off.
参与性推荐者-让人们参与其中的内容将使人类奖励与建模奖励的更好匹配，长期回报将得到回报。
Interactive value learning — this is the most important issue and it can encompass all others. Ultimately, assume the reward function is a distribution and extreme exploitation dramatically decreases (more below)
互动式价值学习-这是最重要的问题，它可以涵盖所有其他问题。最终，假设奖励函数是分布，并且极端剥削急剧减少(更多信息请参见下文)
Design around informed, deliberative judgment — this seems obvious to me, but please no fake news.
围绕明智的，审慎的判断进行设计-这对我来说似乎很明显，但是请不要假新闻。

Let’s continue with point three.

让我们继续第三点。

模拟人类奖励 (Modeling human reward)

The interaction between what the optimization problem is defined as and what the optimization problem really is is the long term battle of applying machine learning systems in safe interactions with humans.

优化问题定义为什么与优化问题实际是什么之间的相互作用应用机器学习系统与人类进行安全互动是一场长期的战斗。

The model used by most machine learning tools now is to optimize a reward function given to the computer by a human. The Standard Model (Russell — Human Compatible, 2019) is nothing more than an optimization problem where the outcomes will improve when the metric on a certain reward function is improved. This falls flat on its face when we consider comparing weighing reward of multiple humans (magnitude and direction), that AIs will exploit unmentioned avenues for action (I tell the robot I want coffee, but the nearest coffee shop is $12, that’s not an outcome I wanted, but it “did it”), and more deleterious unmodeled effects.

现在，大多数机器学习工具使用的模型是优化人类赋予计算机的奖励功能。 标准模型 ( Russell- Human Compatible ，2019年 )无非是一个优化问题，当改善特定奖励函数的指标时，结果将得到改善。当我们考虑比较多个人的称重报酬(幅度和方向)时，这几乎没有面子。我想要，但它“做到了”)，还有更多有害的未建模效果。

What is a better way to do this? The better way is again, interactive value learning. Value learning is a framework that would allow the AIs we make to never assume they have a full model for what humans want. If an AI only thinks that it will have a 80% chance of acting correctly, it will be much timider in its actions to maintain high expected utility (I think about the 20% chance including some very negative outcomes). Recommender systems need to account for this as well, otherwise, we will be spiraling in a game that we have little control over.

有什么更好的方法可以做到这一点？更好的方法还是交互式价值学习 。价值学习是一个框架，可以使我们制造的AI永远不会假设他们拥有人类想要的完整模型。如果一个AI只认为它有80％的机会正确采取行动，那么它在维持较高预期效用的行动中将变得更加怯((我认为20％的机会包括一些非常负面的结果)。推荐系统也需要考虑到这一点，否则，我们将陷入无法控制的游戏中。

人与计算机环环相扣的强化学习 (Reinforcement Learning with Humans and Computers in the Loop)

Reinforcement learning is an iterative framework where an agent interacts with an environment via actions to maximize reward. Reinforcement learning (RL) has had a lot of success with confined games. In this case, there are two ‘game’ framings.

强化学习是一个迭代框架，在该框架中，代理通过动作与环境交互以最大化回报。强化学习(RL)在受限游戏中取得了很多成功。在这种情况下，有两个“游戏”框架。

The application is the agent, and the human is part of the state space (actually fits with the problem formulation better)
应用程序是主体，人是状态空间的一部分(实际上更适合问题的表达)
The human is the agent, the computer, and the world is the environment, and the reward is hard to model. This one is much more compelling, so on I go. This is the game I refer to in my title.
人是代理，计算机是世界，环境是环境，奖惩很难建模。这个更引人注目，所以我继续。这是我在标题中提到的游戏 。

Source — author. 来源—作者。

Ultimately, the FAANG companies are going to be logging all of the traffic data (including heuristics towards true human reward that we talked about earlier) and trying to learn how your device should interact with you. It’s a complicated system that has the downstream effect of everyone else you interact within the feedback loop. As an RL researcher, I know the algorithms are fragile, and I do not want that applied to me (but I struggle to remove myself, frequently). The diagram above is most of the point — there is no way that a single entity can design an optimization to “solve” that net.

最终，FAANG公司将记录所有流量数据(包括我们之前提到的对获得真正人类奖励的启发式方法)，并试图了解您的设备应如何与您互动。这是一个复杂的系统，会对您在反馈循环中与之互动的其他所有人产生下游影响。作为一名RL研究人员，我知道算法很脆弱，并且我不希望将算法应用到我身上(但我经常很难摆脱自我)。上图是重点所在–单个实体无法设计优化来“解决”该网络。

Let’s talk about the data and modeling. To my knowledge, FAANG is not using RL yet, but they are acquiring a large dataset to potentially do so. The process of going from a large dataset of states, actions, and rewards to a new policy is called batch reinforcement learning (or offline RL). It tries to distill a history of unordered behavior into an optimal policy. My view of the technology companies’ applications is that they are already playing this game, but an RL agent doesn’t determine updates to the recommender system, a team of engineers do. The only case that could be made is that maybe TikTok’s black box has shifted towards an RL algorithm prioritizing viewership. If recommendation systems are going to become a reinforcement learning problem, the ethic solutions need to come ASAP.

让我们谈谈数据和建模。据我所知，FAANG尚未使用RL，但他们正在获取可能会使用的大型数据集。从状态，动作和奖励的大型数据集到新策略的过程称为批量强化学习 (或离线RL)。它试图将无序行为的历史提炼成最优策略。 我对技术公司的应用程序的看法是，他们已经在玩这个游戏，但是RL代理并不能确定推荐系统的更新，工程师团队会这样做 。唯一可以做的情况是，TikTok的黑匣子可能已经转向优先考虑收视率的RL算法。如果推荐系统将成为强化学习的问题，则道德解决方案需要尽快出台。

Here are resources for readers interested in batch RL course material, offline RL research, and broad challenges of real-world RL.

这里是对批量RL课程材料，离线RL研究以及现实世界RL的广泛挑战感兴趣的读者的资源。

https://democraticrobots.substack.com/ https://democraticrobots.substack.com/

Like this? Please subscribe to my direct newsletter on robotics, automation, and AI at democraticrobots.com.

像这样？请订阅我的有关机器人技术，自动化和AI的直接通讯，网址为民主机器人网站。

翻译自: https://towardsdatascience.com/recommender-systems-value-alignment-reinforcement-learning-and-ethics-625eefaaf138