如何给ai机器提供学习数据_人工智能机器学习中的数据伦理

如何给ai机器提供学习数据

Ethics is an important aspect of life and unethical of anything is simply harmful and scary. The same principle is also valid and legitimate in the technical world. With the evolution of big data and high performant computing machines, artificial intelligence (AI) has been making a leap and bound progress. We know in order to make an efficient AI system we need a well-curated data and an algorithm which can work for unseen data in a super performant way. So, DATA is the main fuel for an AI or ML (Machine Learning) algorithm and collected data for these purposes can have several biases and unethical element which can confuse or deviate an algorithm to behave differently and generate a whole unethical system which could be dangerous for the society.

道德是生活的重要方面,任何不道德的行为都是有害和令人恐惧的。 同样的原则在技术领域也是有效和合法的。 随着大数据和高性能计算机的发展,人工智能(AI)取得了飞跃性的进步。 我们知道,要打造一个高效的AI系统,我们需要精心设计的数据和一种能够以超高性能方式处理看不见的数据的算法。 因此, 数据是AI或ML(机器学习)算法的主要推动力,为此目的收集的数据可能会具有多种偏见和不道德的元素,这可能会使算法的行为发生混淆或偏离,并产生可能造成危险的整个不道德的系统为社会。

Targeted advertisement, society bias, fake-news are some relevant examples but there are several other instances that happened in the past where an ML algorithm was misused (sometimes unintentionally because all possible behavior of a model was not tested) and proved to behave in an undesired way. Few examples of such use cases are below-

有针对性的广告,社会偏见,虚假新闻是一些相关示例,但在过去发生的其他一些情况下,ML算法被滥用(有时是无意的,因为未测试模型的所有可能行为)并被证明具有不希望的方式。 这种用例的几个例子如下:

  1. UK’s Grading algorithm — Recently, the UK Department of Education discarded grades generated by an algorithm designed to predict the performance of annual A(advance) Level qualification. This initiative was taken due to the COVID pandemic and the result predicted by the algorithm was downgraded by more than a third of A level results in the UK. The developed model primarily focussed on two features ‘student’s past performance’ and ‘school’s historical performance’ to predict the grades of the students. The prediction of the algorithm went in favor of the private schools and the secondary selective and sixth form schools where teacher assessment used to be good severely impacted.

    英国的评分算法-最近,英国教育部放弃了一种旨在预测年度A(高级)水平资格表现的算法所产生的成绩。 采取这项举措是由于COVID大流行,并且该算法预测的结果在英国被降级了超过A级结果的三分之一。 所开发的模型主要关注“学生的过去表现”“学校的历史表现” 两个特征,以预测学生的成绩。 该算法的预测偏向于私立学校以及中学,中学和中学六年级以下的学校,这些学校过去曾严重影响教师评估。

  2. Unethical facial recognition — In the Washington Post article, it was published that how US’s Immigration and Customs Enforcement unethically collected a large volume of data to analyze day-to-day activities of immigrant communities. This is an example of the unethical use of AI to abuse the civil rights of targeted communities.

    不道德的面部识别—在《华盛顿邮报》的一篇文章中 ,发表了一篇文章 ,指出美国移民和海关执法局如何不道德地收集了大量数据来分析移民社区的日常活动。 这是不道德地使用人工智能滥用目标社区公民权利的一个例子。

  3. Amazon’s AI recruiting tool — The developed tool for hiring by Amazon started to bias against female job applicants. See the full story here

    亚马逊的AI招聘工具-亚马逊开发的招聘工具开始偏向女性求职者。 在这里查看全文

  4. Unemployment benefit Fraud — MiDAS (Michigan Integrated Data Automation System) an unemployment system that was launched to replace it’s old COBOL based legacy system booked many people for fraud to claim unemployment benefit. It wrongly accused at least 20,000 claimants of fraud, a shockingly high error rate of 93 percent. The problem was the alleged “Robo-Adjudication” system, which lacked human oversight. The application seeks out discrepancies in claimants’ files and if it finds one, the individuals automatically receive a financial penalty and then, they’re flagged for fraud. Have a look at this metrotimes post for more details.

    失业福利欺诈 -MiDAS(密歇根州集成数据自动化系统)旨在取代其基于COBOL的旧系统而推出的失业系统,导致许多人因欺诈而要求获得失业救济。 它错误地指控至少20,000名索赔人欺诈,错误率高达93%,令人震惊。 问题在于所谓的“机器人裁决”系统,该系统缺乏人为监督。 该应用程序查找索赔人档案中的差异,如果找到差异,则个人将自动受到经济罚款,然后将其标记为欺诈。 看看这个 地铁时报 更多细节。

  5. Microsoft’s unveiled TayA Twitter bot launched with the idea of “The more you chat with Tay, the smarter it gets” got corrupted within 24 hours from its launch with the supply of all misogynistic, racist messages from Twitter. Check this post.

    微软 推出了泰 -一个Twitter的僵尸与Twitter的“与你聊天泰越多,聪明的它就会”得到了在24小时内从推出与所有厌恶女人的电源损坏,种族主义信息的理念推出。 检查这篇文章。

  6. Google’s hate speech detector — Google’s AI tool developed to catch hate speeches started to behave differently towards black people (bias effect).

    Google的仇恨语音检测器-为捕捉仇恨语音而开发的Google AI 工具开始对黑人产生不同的行为(偏见效应)。

So, looking at those malfunctioned AI/ML tools which were certainly developed by top developers and envisioned by great business leaders, suddenly became threats to society. And then the real question appears how can one create an ethical way of working and sensible responsibilities among all groups of collaborators (Data collector, Developer, Decision maker, Sales, Marketing, Executives, etc.)? Several papers have been published in this direction and there are no golden rules to be followed religiously but few important aspects of this problem can be summarized. I would like to highlight them as part and purpose of this article.

因此,查看那些由顶级开发人员开发,并由伟大的商业领导者设想的故障AI / ML工具,突然对社会构成威胁。 然后,真正的问题出现了:如何在所有协作者组(数据收集器,开发人员,决策者,销售,市场营销,执行人员等)之间建立一种道德的工作方式和明智的责任感? 在这一方向上已经发表了几篇论文,没有虔诚地遵循黄金法则,但是这个问题的重要方面很少可以概括。 我想强调它们是本文的一部分和目的。

  1. The 5 Cs

    5 C

如何给ai机器提供学习数据_人工智能机器学习中的数据伦理_第1张图片
Data Ethics — 5 Cs 数据伦理学— 5 Cs

a) Consent — An agreed consent between the data provider and data service.

a) 同意—数据提供者与数据服务之间的同意同意。

b) Clarity — Clarity is directly related to consent to tell data providers that what are they providing.

b) 清晰度—清晰度与同意告知数据提供者他们所提供的内容直接相关。

c) Consistency & Trust — Unpredictable person cannot be trusted hence trust requires consistency. These facts are important and should be the part of data ethics as we have seen many security incidents where these things were broken explicitly or implicitly. Yahoo, Target, Anthem, local hospitals, and government data are a few examples of this.

c) 一致性和信任-无法预测的人不能被信任,因此信任需要一致性。 这些事实很重要,应该成为数据伦理的一部分,因为我们已经看到许多安全事件,这些事件被显式或隐式破坏。 雅虎,塔吉特,国歌,当地医院和政府数据就是其中的几个例子。

d) Control & Transparency — Once the consent is provided now it becomes important to understand how does the data is being used. Do users have any control over them? These questions have important aspect because we know how big companies generally use public data for their own target advertising and creating political and religious sentiments. To address these things up to a certain extent, Europe’s General Data Protection Regulation (GDPR) is a good example that enables users to give their consent to remove the data from the system where it was submitted earlier.

d) 控制和透明度-现在,一旦获得同意,就必须了解如何使用数据。 用户对他们有控制权吗? 这些问题具有重要意义,因为我们知道大公司通常如何使用公开数据进行目标广告,并产生政治和宗教情感。 为了在一定程度上解决这些问题,欧洲的通用数据保护条例 (GDPR)是一个很好的例子,它使用户可以表示同意从较早提交的系统中删除数据。

e) Consequences — Risk can never be eliminated completely. The product using AI and ML gets builds and sometimes due to potential issues around the use of the data some unforeseen consequences arrive. Many regulations and guidelines have been formed to tackle these kinds of problem, e.g., Children’s Online Privacy Protection Act (COPPA) to protect children and their data and Genetic Information Nondiscrimination Act (GINA) in response to rising fears that genetic testing could be used against a person or their family.

e) 后果-永远无法完全消除风险。 使用AI和ML的产品会构建,有时由于数据使用方面的潜在问题而产生一些无法预料的后果。 为应对此类问题,已经制定了许多法规和指南,例如, 出于对人们越来越担心可以使用基因检测来打击儿童的担忧, 《儿童在线隐私保护法》 (COPPA)保护儿童及其数据,以及《 遗传信息非歧视法》 (GINA)。一个人或他们的家人。

Implementing 5 Cs is not the individual responsibilities but it requires the entire team with the idea of shared-responsibilities.

实施5 C并不是个人的责任,而是需要整个团队共同承担责任的理念。

2) Biased Data or Biased Algorithm — This is often an arguable topic among data practitioners for the root cause of bias in the real-world of AI. Is it Data or Algorithm? And of course, there are different views but in most of the case, it is human who developed these mathematical model or feed them with datasets which are often created or collected by them. So, ultimately biased is somewhere related to humans where we need to show the responsibilities and best practices while collecting the data or designing a sci-fi AI model.

2)有 偏数据或有偏算法-这是数据从业人员中经常争论的话题,这是AI现实世界中产生偏见的根本原因。 是数据还是算法? 当然,存在着不同的观点,但是在大多数情况下,是人类开发了这些数学模型或将它们提供给他们通常创建或收集的数据集。 因此,最终存在偏见的是与人类有关的地方,我们需要在收集数据或设计科幻AI模型时表现出责任和最佳实践。

3) Context — Contextual awareness plays a significant role for anyone working in AI and ML areas. Understanding the data to answer some standard questions like what and why am I trying to achieve certain things helps to design an algorithm which senses the decision with the context of the data.

3) 上下文-上下文意识对于在AI和ML领域工作的任何人都起着重要作用。 了解数据可以回答一些标准问题,例如我想做什么以及为什么要实现某些目标,这有助于设计一种算法,该算法可以根据数据的上下文来感知决策。

4) Model Fairness & Explainability — Can a model’s result be trusted? Are they explain why particular features and their weight is important for predictable values? Can we explain that? These questions are relevant to decide that developed models are fair to use for and they are justifiable enough for the purpose they have been developed.

4)模型的公平性和可解释性-模型的结果可以信任吗? 他们是否解释了为什么特定特征及其权重对于可预测值很重要? 我们可以解释一下吗? 这些问题与确定已开发模型是否可以公平使用以及对于已开发目的是否足够合理有关。

5) Model Drift — The analytical models need to be revised with time otherwise there is a high chance of instability and erroneous predictions. In ML/AI, this behavior is defined as Model Drift. It has been classified into two broad categories.

5)模型漂移—分析模型需要随时间进行修改,否则不稳定和预测错误的可能性很高。 在ML / AI中,此行为定义为模型漂移。 它已分为两大类。

i) Concept Drift -It means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways.

i)概念漂移 -这意味着模型试图预测的目标变量的统计属性会以无法预料的方式随时间变化。

ii) Data Drift- This happens when the statistical properties of the predictors change(independent variables). These changes can bound the model to fail. The classic example of data drift is seasonality in data. Black Friday time period always records good sell than other times of the year.

ii)数据漂移 -当预测变量的统计属性发生变化(独立变量)时会发生这种情况。 这些更改可能会导致模型失败。 数据漂移的经典示例是数据的季节性。 黑色星期五时间段始终比一年中其他时间记录良好的卖出。

6) Ethics and Security training — Theory learned or taught as part of the educational curriculum lacks practical implementation and that’s why training of ethics and security is important for professional people because it enables them to implement these principles in the related field.

6)道德与安全培训-作为教育课程一部分学习或教授的理论缺乏实际的实施,这就是为什么道德与安全培训对专业人士很重要的原因,因为它使他们能够在相关领域中贯彻这些原则。

So, if we collect those points as a checklist and follow them while making any decision, then that could be helpful to avoid common mistakes. These points can also enable us to become more responsible and sensitive towards our work. Mike Loukides, DJ Patil, Hilary Mason have compiled below checklists in their book Ethics and Data Science and it is worth having it in our data product checklist.

因此,如果我们将这些要点收集为检查清单并在做出任何决定时遵循它们,那么这将有助于避免常见错误。 这些要点还可以使我们对工作更加负责和敏感。 Mike Loukides,DJ Patil和Hilary Mason 已在其《 道德与数据科学》一书中列出了以下清单,值得在我们的数据产品清单中使用。

清单— (Checklist —)

Have we listed how this technology can be attacked or abused?

我们是否列出了如何攻击或滥用该技术?

Have we tested our training data to ensure it is fair and representative?

我们是否测试了我们的培训数据以确保其公平性和代表性?

Have we studied and understood possible sources of bias in our data?

我们是否研究并了解了数据中可能存在的偏见?

Does our team reflect the diversity of opinions, backgrounds, and kinds of thought?

我们的团队是否反映了观点,背景和思想的多样性?

What kind of user consent do we need to collect to use the data?

我们需要收集什么样的用户同意才能使用这些数据?

Do we have a mechanism for gathering consent from users?

我们是否有一种收集用户同意的机制?

Have we explained clearly what users are consenting to?

我们是否已经清楚说明了用户的同意?

Do we have a mechanism for redress if people are harmed by the results?

如果人们受到结果的伤害,我们是否有补救机制?

Can we shut down this software in production if it is behaving badly?

如果该软件运行不正常,是否可以在生产中关闭该软件?

Have we tested for fairness with respect to different user groups?

我们是否针对不同的用户群体进行了公平性测试?

Have we tested for disparate error rates among different user groups?

我们是否测试过不同用户组之间的错误率不同?

Do we test and monitor for model drift to ensure our software remains fair overtime?

我们是否测试和监视模型漂移以确保我们的软件保持公平的加班时间?

Do we have a plan to protect and secure user data?

我们是否有计划来保护和保护用户数据?

So, in short data ethics principles can help to leverage the full benefit of AI for the good cause of society without any fear and also can create a sense of responsibility among all participants who develop data products to solve critical problems.

因此,简而言之,数据伦理原则可以帮助将AI的全部利益用于社会公益事业,而无需任何恐惧,还可以在开发数据产品以解决关键问题的所有参与者之间建立一种责任感。

References —

参考—

https://hub.packtpub.com/machine-learning-ethics-what-you-need-to-know-and-what-you-can-do/#:~:text=Ethics%20needs%20to%20be%20seen,and%20building%20machine%20learning%20systems.&text=But%20by%20focusing%20on%20machine,robust%20and%20have%20better%20outcomes.

https://hub.packtpub.com/machine-learning-ethics-what-you-need-to-know-and-what-you-can-do/#:~:text=Ethics%20needs%20to%20be% 20seen和%20building%20machine%20learning%20systems。&text =但是%20by%20focusing%20 %% 20machine上的机器,稳健的%20和%20有20%更好的%20结果。

https://learning.oreilly.com/library/view/ethics-and-data/9781492043898/

https://learning.oreilly.com/library/view/ethics-and-data/9781492043898/

https://www.bbc.com/news/explainers-53807730

https://www.bbc.com/news/explainers-53807730

翻译自: https://medium.com/swlh/data-ethics-in-artificial-intelligence-machine-learning-72467b9c70f3

如何给ai机器提供学习数据

你可能感兴趣的:(人工智能,机器学习,大数据,python,java)