同态滤波 python
Matt Canute, Young-Min Kim, Donggu Lee, Suraj Swaroop, Adriena Wong
Matt Canute,Kim Young-Min Kim,Donggu Lee,Suraj Swaroop,Adriena Wong
This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit here.
该博客由西蒙弗雷泽大学计算机科学学院的专业硕士课程的学生编写和维护,作为其课程学分的一部分。 要了解有关此独特程序的更多信息,请访问 此处 。
What is Data Privacy? Why is this important for data scientists or data engineers?
什么是数据隐私? 为什么这对数据科学家或数据工程师很重要?
First, it’s important to note the difference between data privacy and data security, because often these terms are used synonymously with each other. While data security refers to the overall process of protecting data from unauthorized access, data privacy is a branch of data security that deals with the collection, distribution, and use of data. Consider a case where companies encrypt and protect the user data from external attackers. They have preserved data security, but may still be violating data privacy if the data was collected without proper consent. On the other hand, if companies conserve data privacy but do not secure user data reliably, they’ll probably have violated both data privacy and data security when attacked by hackers. You can have data protection without data privacy, but not data privacy without data protection.
首先,重要的是要注意数据隐私和数据安全性之间的区别,因为这些术语通常彼此同义地使用。 数据安全性是指保护数据免受未经授权的访问的整个过程,而数据隐私性则是数据安全性的一个分支,涉及数据的收集,分发和使用 。 考虑一下公司加密和保护用户数据免受外部攻击者侵害的情况。 他们保留了数据安全性,但是如果未经适当同意收集数据,可能仍会侵犯数据隐私。 另一方面,如果公司保护数据隐私,但不能可靠地保护用户数据,则当受到黑客攻击时,它们可能同时侵犯了数据隐私和数据安全性。 您可以在没有数据隐私的情况下获得数据保护,但在没有数据保护的情况下没有数据隐私。
Regarding data privacy, data scientists and data engineers might often have a role in designing the data collection process of various applications, so that the proper information can be stored for effective decision analysis and model-building.
关于数据隐私,数据科学家和数据工程师可能经常在设计各种应用程序的数据收集过程中发挥作用,以便可以存储适当的信息以进行有效的决策分析和模型构建。
However, one trend to be wary of is that the design of today’s data-intensive applications seem to cause a chilling effect in the ways that we interact with them. Users of a tool may not feel as free to search or click whatever they want, knowing that their information has forever been logged somewhere in a database. For instance, user behaviour may have been affected by Facebook’s collection of user data from Android devices and/or the Snowden leaks. This is especially the case now that data security systems appear to be more fragile due to the frequent news reports of hacked databases from large companies, such as the recent LifeLabs hack, so it feels inevitable that our search history wouldn’t be private for much longer.
但是,需要警惕的一种趋势是,当今数据密集型应用程序的设计似乎在我们与之交互的方式中引起了寒蝉效应 。 工具的用户可能知道自己的信息永远被记录在数据库的某个地方,因此可能不会随意搜索或单击他们想要的任何东西。 例如,用户行为可能已受到Facebook从Android设备收集的用户数据和/或Snowden泄漏的影响 。 现在尤其如此,由于频繁出现来自大型公司的数据库被黑客入侵的新闻报道(例如最近的LifeLabs黑客攻击) ,数据安全系统似乎更加脆弱,因此,我们的搜索历史不会在很多时候都是私有的,这是不可避免的更长。
Due to these concerns, we’re seeing various levels of policy implementations to address them, as many different countries attempt to deal with their interpretation of data privacy, data rights, and data security, such as the General Data Protection Regulation (GDPR) in the European Union (EU). Here are some other examples of data regulation laws:
由于这些担忧,我们看到各种级别的政策实施方案都可以解决这些问题,因为许多不同的国家/地区都在尝试处理其对数据隐私,数据权利和数据安全的解释,例如美国的通用数据保护条例( GDPR )欧洲联盟(EU)。 以下是数据法规的其他示例:
Australia : The Notifiable Data Breach (NDB)
澳大利亚:可报告的数据泄露( NDB )
Brazil : Lei Geral de Protecao de Dados (LGPD)
巴西:雷·杰拉尔·德·达科斯( LGPD )
Canada : The Personal Information Protection and Electronic Documents Act (PIPEDA)
加拿大:个人信息保护和电子文件法( PIPEDA )
So, we have two issues:
因此,我们有两个问题:
- A user’s actions collected from these tools are probably becoming less useful, since the user may be wary of the fact that their actions are being logged onto a server somewhere that could eventually be hacked and leaked, and thus the tools might have a skewed overall pattern of behaviour in a misleading way. 从这些工具收集的用户操作可能变得不再有用,因为用户可能会警惕他们的操作已登录到最终可能被黑客入侵和泄漏的服务器上的事实,因此这些工具的总体模式可能会偏斜行为以一种误导的方式。
- The increasing number of regulations as mentioned above are making it difficult to start building an application without knowing all the protocols to follow. For example, under GDPR, a citizen has the right to be forgotten from a database or a model, but this becomes tricky when the model is a black-box stack of neural networks, and it’s difficult to know whether their personal data has accidentally been ‘memorized’ by the model for making recommendations. 如上所述,越来越多的法规使不知道要遵循的所有协议就很难开始构建应用程序。 例如,根据GDPR,公民有权从数据库或模型中被遗忘,但是当模型是神经网络的黑匣子堆栈时,这变得很棘手,并且很难知道他们的个人数据是否被意外删除了。被模型“记住”以提供建议。
It turns out that these can both be addressed to an extent using the technique of differential privacy.
事实证明,使用差异隐私技术都可以在一定程度上解决这些问题。
差异隐私 (Differential Privacy)
Differential privacy is a method for sharing information about the total population from a statistical database or dataset, while also withholding any private information by adding noise based on certain mechanisms. The reason we call it differential privacy is because it allows you to mask your dataset against a ‘differential attack,’ which involves combining data from multiple sources to derive information that wasn’t included in the original dataset.
差异隐私是一种用于从统计数据库或数据集中共享有关总人口信息的方法,同时还可以通过基于某些机制添加噪声来保留任何私人信息。 我们称其为“差异隐私”的原因是,它使您能够针对“差异攻击”掩盖数据集,其中涉及将来自多个来源的数据进行组合,以得出原始数据集中未包含的信息。
Addressing issue 1: The nature of differential privacy’s technique to add noise in order to improve the collection of data has been a process used since the 1960s for eliminating evasive answer bias, which would involve the cases where users are surveyed on difficult questions, such that they’d have an incentive to lie under ordinary circumstances. For example, consider conducting a survey where you’re trying to determine the true proportion of data scientists that have felt pressure to “massage the data” in order to obtain a certain conclusion. Instead of asking them that question outright, you’d first ask them to flip a coin, and then to flip the coin again. If it landed on heads, then they can feel free to say yes or no. But if it landed on tails, then they have to use the second coin flip to determine what to say: heads for yes, and tails for no. As a result, people are more relaxed, and hence more truthful in their answers because now they can rest assured that if the survey data is ever leaked, then they have plausible deniability that they answered ‘yes’ purely because they flipped a tails and then a heads. The results can then be aggregated up to uncover the true proportion of people that have “massaged the data”, after a large-enough sample size.
解决问题1:自1960年代以来,差异隐私技术通过添加噪声以改善数据收集的技术性质就一直用于消除回避性的回答偏差 ,该过程涉及对用户进行棘手调查的情况,例如他们会在平常的情况下撒谎。 例如,考虑进行一项调查,在其中您试图确定感到有压力要“整理数据”以获得一定结论的数据科学家的真实比例。 与其直接问他们一个问题,您不如先问他们掷硬币,然后再掷硬币。 如果它落在头上,那么他们可以随意说是或否。 但是,如果它落在尾巴上,那么他们就必须使用第二次硬币翻转来确定要说的是:正面为是,而背面为否。 结果,人们会更加放松,因此他们的回答也更加真实,因为现在他们可以放心,如果调查数据泄露了,那么他们就有合理的可否认性,他们纯粹是因为“甩尾巴”然后回答“是”。一个头。 然后,可以将结果汇总起来,以发现足够大的样本量后发现“已整理数据”的人们的真实比例。
Addressing issue 2: Incorporating differential privacy mechanisms into machine learning algorithms can help to prevent accidentally incorporating personal information into the model. For example, although you could remove the columns of social insurance numbers and credit card numbers from the training set of a model that you’re building, credit card numbers could be unexpectedly leaked into the model through a text feature’s bag of words representation. Using differential privacy can reduce the chances that a model will overfit to individual entries on a dataset. For example, there was a team in 2018 that improved a model which was previously leaking social security numbers, and they showed that under the differentially private model, no such leakage occurred, while the accuracy performed nearly as well [1].
解决问题2:将差异隐私机制纳入机器学习算法可以帮助防止将个人信息意外纳入模型中。 例如,尽管您可以从正在构建的模型的训练集中删除社会保险号和信用卡号列,但是信用卡号可能会通过文本功能的单词表示袋意外泄漏到模型中。 使用差异隐私可以减少模型过度适合数据集中各个条目的机会。 例如,有一个团队在2018年改进了以前泄漏社会保险号的模型,他们表明,在差分私有模型下,不会发生这种泄漏,而准确性几乎相同[1]。
Experimenting with Differential Privacy applications in Python:
在Python中试用差异隐私应用程序:
Here’s how this works.
这是这样的。
The key to achieve differential privacy is to add a random noise to the ground truth. There is always a fundamental trade-off between the noise (ε) and the accuracy in a differential privacy algorithm. For example, a dataset consisting purely of random noise will obviously maintain complete privacy, but it won’t be useful. Epsilon (ε) is the metric determining privacy loss when there is a differential change, such as adding or subtracting a datapoint. A smaller ε will generate higher privacy protection, but lower accuracy. Accuracy is the degree of similarity between the output from the differential privacy algorithm versus the pure output.
实现差异隐私的关键是向地面真实情况添加随机噪声。 在差分隐私算法中,噪声(ε)和精度之间始终存在基本的权衡。 例如,纯粹由随机噪声组成的数据集显然将保持完全的隐私性,但将无用。 Epsilon(ε)是在存在差异变化(例如添加或减去数据点)时确定隐私丢失的度量。 较小的ε将产生较高的隐私保护,但准确性较低。 准确性是差分隐私算法的输出与纯输出之间的相似度。
There are two primary noise mechanisms in differential privacy, namely the Laplace mechanism and the exponential mechanism:
差分隐私中有两种主要的噪声机制,即拉普拉斯机制和指数机制:
Laplace Mechanism:
拉普拉斯机制 :
The Laplace mechanism is used when the output is numerical. The Laplace mechanism will compute a function f that maps a database query to k real numbers. It will subject each datapoint to some noise taken from the Laplace distribution, Lap(x|b) where b is a scaling variable and x is a database, which is a symmetrical version of the exponential function. More specifically, the noise variables will be scaled based on the sensitivity of f (divided by ε), denoted by ∆f/ε. The mechanism, denoted by ML(), is defined by
输出为数字时使用拉普拉斯机制 。 拉普拉斯机制将计算将数据库查询映射到k个实数的函数f。 它将使每个数据点受到从Laplace分布Lap(x | b)中获取的一些噪声的影响,其中b是比例变量,x是数据库,它是指数函数的对称形式。 更具体地说,噪声变量将基于f的灵敏度(用ε除以∆f /ε表示)进行缩放。 用ML()表示的机制由
where Yi are identically, independently distributed random variables drawn from Lap(∆f /ε) [2].
其中,Yi是相同的,是从Lap(∆f /ε)[2]中得出的独立分布的随机变量。
Exponential Mechanism:
指数机制 :
The exponential mechanism is another security-controlled plan to achieve differential privacy when the outputs are categorical. The mechanism selects the best response from an abstract non-numeric range R that has the maximum utility score. This is based on a utility function u() that maps from database x and output r ∈ R pairs to a utility score. More specifically, u(x, r) represents how good the output r is for database x. The exponential mechanism outputs every r∈ R with probability proportional to exp(εu(x, r)/∆u) where ∆u is the sensitivity of u with respect to x defined by:
指数机制是另一种安全控制的计划,可在输出为分类时实现差异隐私。 该机制从具有最大效用得分的抽象非数字范围R中选择最佳响应。 这是基于效用函数u()的,该函数从数据库x映射并输出r∈R对成效用分数。 更具体地说,u(x,r)表示数据库x的输出r有多好。 指数机制以与exp(εu(x,r)/ ∆u)成正比的概率输出每个r∈R,其中∆u是u对x的敏感度,定义如下:
where x, y are databases [2].
其中x,y是数据库[2]。
Thus, the privacy loss function [2] can be represented by:
因此,隐私丢失功能[2]可以表示为:
Some examples of applications that you probably use today that uses differential privacy:
您今天可能使用的使用差异隐私的一些应用程序示例 :
Apple IOS10 onwards uses a differential privacy implementation to help protect the privacy of user’s activity, while gaining insights that improve the usability of features such as emoji suggestions and lookup hints.
Apple IOS10及更高版本使用差异隐私实现来帮助保护用户活动的隐私,同时获得洞察力,以改善表情符号建议和查找提示等功能的可用性。
Similar to Apple’s technique, Google uses a differential privacy tool called Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) with Google Chrome to prevent sensitive information like a user’s browsing history from being traced. As Google Chrome is one of the most widely used browsers, it is susceptible to attacks from malicious software. Google collects usage statistics from users who agreed to send their data in and uses RAPPOR to draw overall insights from the data while maintaining each individual user’s privacy.
与苹果公司的技术类似,谷歌在谷歌浏览器中使用一种称为“可随机化的, 可保存隐私的序数响应”( RAPPOR )的差分隐私工具,以防止跟踪诸如用户浏览历史之类的敏感信息。 由于Google Chrome浏览器是使用最广泛的浏览器之一,因此容易受到恶意软件的攻击。 Google会收集同意发送其数据的用户的使用情况统计信息,并使用RAPPOR从数据中获取整体见解,同时维护每个用户的隐私。
即将推出:数据安全性的最新发展:同态加密 (Coming soon: A Current Development in Data Security: Homomorphic Encryption)
Currently, today’s encryption algorithms are very reliable due to the amount of processing power required to crack it which is both highly costly and time-consuming. One popular encryption method is the RSA (Rivest-Shamir-Adleman) cryptosystem which uses a public encryption key and a private decryption key, and relies on the factoring of the product of two very large prime numbers. Unfortunately, in order to perform any data processing or analysis, it is necessary to decrypt the data first which may jeopardize any private information such as sensitive financial or medical information. One solution is fully homomorphic encryption, which is an encryption method that uses an algebraic procedure to allow the data being encrypted to be manipulated and processed without first decrypting it. The key advantage is that after performing calculations on the encrypted data, we can still effectively query the encrypted databases for analysis. When the encrypted data is decrypted, the same analyses performed on the decrypted data would yield the same results. This is extremely useful since it allows a third party to perform computations directly on the encrypted data without the need for decryption, thus keeping the contents of the data private.
当前,由于破解所需的处理能力非常大,因此,当今的加密算法非常昂贵且耗时。 一种流行的加密方法是RSA(Rivest-Shamir-Adleman)密码系统,该系统使用公共加密密钥和私有解密密钥,并依赖于两个非常大的质数乘积的分解。 不幸的是,为了执行任何数据处理或分析,必须首先解密数据,这可能会危害任何私人信息,例如敏感的财务或医疗信息。 一种解决方案是完全同态加密,这是一种加密方法,它使用代数过程来允许对加密的数据进行处理和处理,而无需先对其进行解密。 关键优势在于,在对加密数据进行计算之后,我们仍然可以有效地查询加密数据库以进行分析。 当解密加密数据时,对解密数据执行的相同分析将产生相同的结果。 这非常有用,因为它允许第三方直接对加密的数据执行计算而无需解密,从而使数据的内容保持私有。
Unfortunately, fully homomorphic encryption is still being developed as it’s currently impractically slow due to its large computational overhead. It also can only support addition and multiplication as other operations, such as sorting or regular expressions, would be too complex for its current implementation.
不幸的是,由于其巨大的计算开销,由于目前不切实际的缓慢,全同态加密仍在开发中。 它也只能支持加法和乘法运算,因为其他操作(例如排序或正则表达式)对于当前的实现来说太复杂了。
Although the first fully homomorphic encryption scheme was created in 2009 by IBM research Craig Gentry, it wasn’t until 2016 that IBM came out with the first version of HElib library, an open-source homomorphic encryption library in C++, which unfortunately performed significantly slower than plaintext operations.
尽管第一个完全同态加密方案是由IBM研究人员Craig Gentry于2009年创建的,但是直到2016年IBM才推出了第一版HElib库 ,这是C ++的开源同态加密库,但不幸的是,它的执行速度明显慢比纯文本操作。
In 2017, Microsoft’s Research Group released Microsoft SEAL (Simple Encrypted Arithmetic Library) in C++ which offers encrypted data storage and computation services using homomorphic encryption. In an effort to make this library more practical for data scientists and other engineers, Lab41 and B.Next have created a port of the Microsoft SEAL library to Python, called PySEAL, which can be easily imported into Python projects. This link shows an example of how PySEAL works.
2017年,微软研究小组发布了C ++中的Microsoft SEAL (简单加密算术库),它使用同态加密提供了加密的数据存储和计算服务。 为了使该库对数据科学家和其他工程师更实用,Lab41和B.Next创建了Microsoft SEAL库到Python的端口,称为PySEAL ,可以轻松将其导入Python项目中。 此链接显示了PySEAL如何工作的示例。
In 2019, Google also rolled out with an open-source Private Join and Compute functionality that makes use of homomorphic encryption to allow two users to privately do certain aggregate ]calculations on two joined datasets without knowing the contents of the inputs.
在2019年,Google还推出了开放源代码的“ 私有加入和计算”功能,该功能利用同态加密功能,允许两个用户在不知道输入内容的情况下,对两个加入的数据集进行私有的合计计算。
结论 (Conclusion)
As data privacy becomes an increasingly popular topic in the news and within the field of data science, companies will need to consider having a safeguard for keeping their client’s personal data safe in case of a data breach. We have introduced two methods of protecting private data while still maintaining the ability to draw analyses from it: differential privacy and homomorphic encryption.
随着数据隐私成为新闻和数据科学领域中越来越受欢迎的话题,公司将需要考虑采取一种保护措施,以防万一发生数据泄露时保护其客户的个人数据安全。 我们介绍了两种保护私有数据同时仍保持从中进行分析的能力的方法:差分隐私和同态加密。
[1] Rachel Cummings, Deven Desai. The Role of Differential Privacy in GDPR Compliance. https://piret.gitlab.io/fatrec2018/program/fatrec2018-cummings.pdf, 2018.
[1] Rachel Cummings,Deven Desai。 差异隐私在GDPR合规性中的作用。 https://piret.gitlab.io/fatrec2018/program/fatrec2018-cummings.pdf,2018年 。
[2] Cynthia Dwork, Aaron Roth. The Algorithmic Foundations of Differential Privacy. Now Publishers Inc, 2014.
[2]辛西娅·德沃克(Cynthia Dwork),亚伦·罗斯(Aaron Roth)。 差异隐私的算法基础。 现在是Publishers Inc,2014年。
翻译自: https://medium.com/sfu-cspmp/various-approaches-towards-data-privacy-with-python-differential-privacy-and-homomorphic-a748e560d43b
同态滤波 python