美团机器学习实践 密码_机器学习遇到密码学的地方

美团机器学习实践 密码

When reading this, chances are that you know one or another thing about machine learning already. You know that machine learning algorithms typically take in a bunch of samples, each containing a fixed amount of features, and output a prediction in the end.

阅读本文时,您很有可能已经对机器学习有所了解。 您知道机器学习算法通常会吸收一堆样本,每个样本都包含固定数量的特征,最后输出预测

What you maybe have heard about (but did not dig deeper into) is the field of cryptography. It is this mysterious subject where it’s all security, passwords, hiding things. Maybe you have even heard about AES or RSA, which are algorithms to encrypt data.

您可能听说过(但没有更深入地研究) 加密领域。 这是一个神秘的主题,它包含所有安全性,密码和隐藏的内容。 也许您甚至听说过AESRSA ,它们是加密数据的算法

But don’t worry, even if you have never dealt with cryptography before, you will be able to follow along since I will explain everything on an introductory level .

但是不用担心,即使您以前从未处理过加密技术,也可以继续学习,因为我将在入门级进行解释。

In this article, I want to bring both fields together. I will present to you an easy to understand, yet hard to solve problem used to build cryptographic algorithms — the so-called Learning Parity with Noise problem, LPN for short. The “L” in LPN should ring your machine learning alarm bells already because this problem can be seen as a routine machine learning problem!

在本文中,我想将这两个领域结合在一起。 我将向您介绍一个易于理解但难以解决的用于构建密码算法的问题,即所谓的“ 学习带有噪声的奇偶性”问题,简称LPN 。 LPN中的“ L”应该已经响起了您的机器学习警报,因为这个问题可以看作是例行的机器学习问题!

But first, let us see where the LPN problem naturally arises in a cryptographic setting and how to define it. We will solve the LPN problem by using machine learning afterward.

但是首先,让我们看看在密码设置中LPN问题自然产生的位置以及如何定义它。 之后,我们将通过机器学习来解决LPN问题。

动机 (Motivation)

Imagine that you own a hotel and you want to manage access to the guests’ rooms, i.e. each guest should only be able to enter their own room. Makes sense, right?

想象一下,您拥有一家酒店,并且想要管理对客人房间的访问,即每个客人只能进入自己的房间。 有道理吧?

Your hotel. Photo by Eiji K on Unsplash. 您的酒店。 Eiji K在 Unsplash上 拍摄 。

Now, traditionally you could use normal, physical keys. The disadvantage is that people sometimes lose their keys, which means a lot of costs for your business since you have to replace the lock from the affected doors.

现在,传统上您可以使用普通的物理键。 缺点是人们有时会丢失他们的钥匙,这意味着您必须为受影响的门更换锁,这对您的企业来说是很大的成本。

So you decide on deploying smart cards, in particular cards with RFID (radio-frequency identification) chips, and also the corresponding locks. Since you have to provide for a lot of doors and you want to save money, you choose very weak RFID chips, i.e. chips with diminishing computational power, maybe even without its own source of electricity.

因此,您决定部署智能卡,尤其是带有RFID(射频识别)的卡 芯片,以及相应的锁。 由于必须提供大量的门并且要节省资金,因此您选择了非常弱的RFID芯片 ,即计算能力降低的芯片,甚至可能没有自己的电源。

Susanne Plank on 苏珊·普朗克上 Pixabay. Pixabay 。

The way your system should work is the following: Every lock and every card has a secret key stored, a binary vector such as s=(1,0,1,0), just much longer in practice. If you hold your card next to a lock, the lock works as a reader, scanning the card’s secret key. The chip is called a tag in this context.

系统的工作方式如下:每个锁和每张卡都存储有一个秘密密钥 ,一个二进制向量,例如s = (1,0,1,0), 实际上要长得多 。 如果您将卡放在锁旁边,则该锁将用作读取器 ,扫描卡的密钥。 在这种情况下,该芯片称为标签

The clue: If the secret keys of the card and the door match, the door opens.

提示:如果卡的密码和门的密钥匹配,则门将打开。

Perfect! But how to do it? Well, an easy way is to hold your card next to the lock and the lock tells the chip on the card to send its secret key to the lock. Then the lock checks if both secret keys are equal and open the door, if yes.

完善! 但是怎么做呢? 好吧,一种简单的方法是将您的卡放在锁旁边,锁告诉卡上的芯片将其秘密密钥发送到锁。 然后,锁检查两个秘密钥匙是否相等,如果是,则打开门。

This makes sense, because if you do not have the correct card, i.e. the secret key on your chip is different from the secret key in the door lock, the door will not open.

这是有道理的,因为如果您没有正确的卡,即芯片上的密钥与门锁中的密钥不同,则门将无法打开。

问题 (The Problem)

The trouble with this solution begins when a guest wants to enter their room: A bad guy, usually called an attacker in cryptography, could sit in the hallway, apparently just typing innocently on their notebook. What the attacker actually does is sniffing the RFID traffic, i.e. reading the communication between the lock and the guest’s chip. If the chip sends the secret key directly, the attacker will see it, store it, forge a card containing this key and then will be able to enter the room.

这种解决方案的麻烦始于客人要进入房间时:一个通常被称为密码学攻击者的坏人可能坐在走廊上,显然只是在他们的笔记本上无辜地打字。 攻击者实际上所做的是嗅探 RFID流量,即读取锁和访客芯片之间的通信。 如果芯片直接发送密钥,攻击者将看到它,将其存储,伪造一张包含该密钥的卡,然后便可以进入房间。

A prototypical hacker at work, this time without a ski mask. Photo by Nahel Abdul Hadi on Unsplash. 一个典型的黑客正在工作,这次没有戴滑雪帽。 Nahel Abdul Hadi 摄于 Unsplash 。

So, this is a bad idea. It only works if there are no bad people in the world (highly unlikely). Instead, we have to arm ourselves and improve security for our guests. The idea is the following:

因此,这是一个坏主意。 它仅在世界上没有坏人的情况下才有效(极不可能)。 相反,我们必须武装自己并提高客人的安全性。 这个想法如下:

The chip somehow has to prove to the lock that it possesses the correct secret key without revealing it.

芯片必须以某种方式向锁证明它拥有正确的秘密密钥而不泄露它。

I hear you scream: That’s what encryption is for! And you are right. The attacker would only see garbage in the sniffing tool and wouldn’t be able to reconstruct the key. But sadly, the RFID chip is much too weak for encrypting anything because you wanted to save money, remember? Sadly, this is also true for bigger companies in the real world. The chip has nearly no computational power and also only barely enough storage for its secret key. So we need another, more light-weight solution.

我听到你在尖叫: 这就是加密的目的! 你是对的。 攻击者只会在嗅探工具中看到垃圾,而无法重构密钥。 但是可悲的是,RFID芯片太弱了,无法加密任何东西,因为您想省钱,还记得吗? 可悲的是,对于现实世界中的大型公司而言,情况也是如此。 该芯片几乎没有计算能力,也几乎没有足够的存储空间来存储其密钥。 因此,我们需要另一个更轻便的解决方案。

One way to do that is to use a cryptographic protocol like the HB Protocol by Hopper and Blum [1]. This protocol makes it difficult for this attacker to extract the key.

一种方法是使用像HopperBlum [1]的HB Protocol这样的加密协议。 该协议使攻击者很难提取密钥。

Photo by Goh Rhy Yan on Unsplash Goh Rhy Yan在 Unsplash上 拍摄的照片

The vanilla HB Protocol that I am going to introduce has other vulnerabilities and should not be used in practice. I just use because it is easy to explain. For real-world security, more secure extensions of this protocol or other secure protocols should be used.

我将要介绍的香草HB协议还有其他漏洞,不应在实践中使用。 我使用它是因为它很容易解释。 为了实现真实世界的安全性,应使用此协议或其他安全协议的更安全的扩展。

HB协议 (HB Protocol)

So, you have a reader R (the lock) and a tag T (your chip). T now wants to prove to R that it possesses the same secret key without revealing it. This is done by R repeatedly challenging T with questions only a tag with the correct secret key can answer. So far, we have seen that the single question “What is your secret key?” is insecure since this reveals too much information already. Instead, in the HB Protocol T is asked to only reveal small portions of the secret one tiny bit at a time, until R can be sure that T has the correct secret key.

因此,您有一个读取器R (锁)和标签T (您的芯片)。 T现在想向R证明它拥有相同的秘密密钥而没有透露它。 这是通过R反复向T提出问题来挑战T的 ,只有具有正确密钥的标签才能回答。 到目前为止,我们已经看到了一个问题:“您的秘密密钥是什么?” 是不安全的,因为这已经暴露了太多信息。 取而代之的是,在HB协议中,要求T一次仅透露一小部分秘密的一小部分,直到R可以确定T具有正确的秘密密钥为止。

Imagine that the secret keys of R and T are in fact both the same s=(1,0,1,0). Now R sends a random binary vector a (e.g. a=(1,0,1,1)) to T and expects T to respond back to it the scalar product b=, which is

想象一下, RT的秘密密钥实际上都是相同的s =(1,0,1,0)。 现在R发出一个随机二进制矢量 (例如,=(1,0,1,1))T,我们期望至响应回到它的标量积B = 这是

in this example. We call this a a challenge. Remember, we deal with bit arithmetic here, so the “+” is, in fact, an XOR. The multiplication is the same as in the real numbers. Or for mathematicians: we calculate in the field GF(2) or ₂, the field with 2 elements.

在这个例子中。 我们称这是一个 挑战 。 记住,我们在这里处理位算术,因此“ +”实际上是XOR。 乘法与实数相同。 或对于数学家:我们在GF(2)或2字段中计算包含2个元素的字段。

XOR is like the normal addition for integers, just that 1+1=0. XOR就像整数的普通加法一样,只是1 + 1 = 0。

R can compute the scalar product itself (it knows a and s) and checks T’s answer. If T’s answer is the same, R can be a bit more confident that T indeed has the same secret key. To increase confidence, this game is repeated several times.

R可以计算标量积本身(它知道as )并检查T的答案。 如果T的答案相同,则R可以更确定T确实具有相同的密钥。 为了增加信心,该游戏重复了几次。

For example, if T does not have the correct key, it would be very unlikely to succeed after a sufficient number of rounds, since a single response would be only correct with probability 0.5. Hence, after 10 rounds, for example, the chance of successful authentication is just 1/1024, less than 0,1%.

例如,如果T没有正确的密钥,那么经过足够多的回合后,成功的可能性将很小,因为单个响应仅以0.5的概率是正确的。 因此,例如,在10轮之后,成功身份验证的机会仅为1/1024, 小于0.1%

This sounds much better, right? T is not revealing its secret in one go now, instead, it gives some information to R by answering the challenges. But sadly, this is also completely insecure. An attacker could still write down the complete communication between R and T and then easily solve a system of linear equations to recover s. This is done in the following way: Imagine the attacker has written down the following for challenge/response pairs:

听起来好多了吧? T并没有立即透露其秘密,相反,​​它通过回答挑战为R提供了一些信息。 但是可悲的是,这也是完全不安全的。 攻击者仍然可以写下RT之间的完整通信,然后轻松地求解线性方程组以恢复s 。 这是通过以下方式完成的:想象攻击者为挑战/响应对写下了以下内容:

The attacker also knows that

攻击者也知道

where A is the matrix containing the aᵢ’s as rows and b the bᵢ’s. In our case:

其中A是包含aᵢ的行和b b b矩阵 。 在我们的情况下:

The system of linear equations the attacker has to solve. 攻击者必须解决的线性方程组。

So solving this system for s yields the secret. This can also easily be done via Gaussian Elimination if s is much larger, i.e. 1024 bits long. The solution is s=(1,0,1,0) by the way. :)

因此,解决该系统的s产生了秘密。 如果s大得多,即1024位长,也可以通过高斯消除轻松完成。 顺便说一下,解是s =(1,0,1,0)。 :)

带有噪声的学习平价问题 (The Learning Parity with Noise Problem)

There is one very small but extremely important tweak to make this secure against our attacker: T just adds some random Bernoulli noise to its responses. Instead of sending back to R, it flips a coin e which is 1 with probability p and 0 otherwise and sends back +e to the reader. In other words, with probability 1-p the tag sends back to R and with probability p it flips the response bit from 0 to 1 or from 1 to 0. We assume that p<0.5.

有一个非常小但非常重要的调整,可以使此方法对付我们的攻击者安全: T只会在响应中添加一些随机的伯努利噪声。 而不是将发送回R,而是将概率为p的硬币e翻转为1 否则翻转为0,然后将 + e发送回阅读器 换句话说,标签以概率1- p发送回R,并以概率p将响应位从0翻转为1或从1翻转为0 我们假设p < 0.5。

This does not prevent the attacker from sniffing the communication between R and T and taking notes, of course, but they have to solve the following problem now:

不会阻止攻击者嗅探RT之间的沟通和记笔记当然,但他们现在必须解决以下问题:

This notation indicates that each equation of the equation system is only correct with probability 1-p. More formally, you can write it as As+e=b, where e is the noise vector with each component (independently) being 1 with probability p and 0 with probability 1-p.

该符号表示方程组的每个方程仅在概率为1-p的情况下才是正确的。 更正式地讲,您可以将其写为As + e = b,其中e是噪声矢量,每个分量(独立地)的概率为p均为1,概率为1-p为0。

Thus, the attacker has to solve a noisy system of equations over GF(2). For a constant error rate p, this problem — the Learning Parity with Noise (LPN) Problem — is conjectured to be infeasible to solve for large enough length of the secret key. This is also true, if the attacker can get arbitrarily many equations.

因此,攻击者必须解决GF(2)上一个嘈杂的方程组。 对于恒定的错误率p,此问题(带有噪声的学习奇偶性(LPN)问题)被认为无法解决足够长的密钥。 如果攻击者可以任意获得许多方程式,这也是正确的。

Even with these errors added, R can do its job of determining whether T knows s or not. If T has the correct s, a fraction of about 1-p responses will be correct. That means if p=0.25 the HB Protocol runs for 1000 iterations, T should give around 750 correct responses.

即使添加了这些错误, R仍可以确定T是否知道s 。 如果T具有正确的s,则大约1- p响应的一部分将是正确的 这意味着如果p = 0.25 HB协议运行1000次迭代,则T应该给出750个正确的响应

If T does not have the correct s, it will give a fraction of around 0.5 correct answers, i.e. 500 out of 1000 rounds protocol run. This allows R to decide whether T has the correct secret or not and this protocol still makes sense for our use case.

如果T没有正确的s ,它将给出大约0.5个正确答案的分数,即每1000轮协议运行中有500个正确答案。 这使R可以决定T是否具有正确的机密,并且该协议对于我们的用例仍然有意义。

通过机器学习解决LPN (Solving LPN via Machine Learning)

Let’s get to the fun part now. We have established that solving the LPN problem means, given a random binary matrix A and a binary vector b=As+e, recovering s.

现在让我们开始有趣的部分。 我们已经确定,解决LPN问题意味着,在给定随机二进制矩阵A和二进制向量b = As + e的情况下 ,恢复s。

The important observation: We can treat each row aᵢ of matrix A now as a sample and the corresponding value bᵢ=+eᵢ in the vector b as the label.

重要的观察结果:我们现在可以将矩阵A的每一行aᵢ当作样本,并将向量b中的对应值bᵢ= +eᵢ视为标签。

Using an example with a secret length of 4 and six captured communications, we can see that each row of matrix A consists of four features and the entry in b is the corresponding label. Our dataset has a size of six. 使用一个秘密长度为4且捕获了六个通信的示例,我们可以看到矩阵A的每一行都包含四个特征,并且b中的条目是相应的标签。 我们的数据集大小为6。

As found in normal datasets used in machine learning, a label bᵢ actually resembles the scalar product of the feature vector aᵢ and a fixed secret vector s (some ground truth), but with an error term added. But how can we get the secret s when we throw a machine learning algorithm for predicting the labels on it?

正如在机器学习中使用的正常数据集中所发现的,标签bᵢ实际上类似于特征向量aᵢ与固定秘密向量s的标量积(某些基本事实),但是添加了一个误差项。 但是,我们如何能得到秘密S当我们抛出一个机器学习算法,它预测的标签

Well, if we could learn the problem reasonably well, we could generate good predictions for the labels (the scalar products; the ground truth) for each feature vector aᵢ we like. If we throw in the vector a=(1,0,0,0), we would then receive a good guess for

那么,如果我们可以学习的问题相当好,我们可以生成标签良好的预测(标产品;地面实况)对于每个特征向量aᵢ我们喜欢。 如果我们将向量a =(1,0,0,0)丢进去,那么我们将得到一个很好的猜测

the first bit of s! Do the same with the vectors (0, 1, 0, 0), (0, 0, 1, 0) and (0, 0, 0, 1) and we have all the bits of the secret key.

的第一位 ! 对向量(0、1、0、0),(0、0、1、0)和(0、0、0、1)进行相同操作,我们就拥有了密钥的所有位。

Thus, we can solve the LPN problem using machine learning.

因此,我们可以使用机器学习来解决LPN问题。

备注 (Remarks)

The LPN Problem is a very versatile problem that you can also use to build encryption, identity-based encryption, user authentication, oblivious transfer, message authentication codes, and probably more constructions. Also, unlike the factorization problem, the LPN problem cannot easily be solved using quantum computers. Paired together with its light-weightiness it is a good candidate for building post-quantum secure algorithms. So, no worries, if RSA, which is kind of based on factoring large numbers, dies in the presence of quantum computers. ;)

LPN问题是一个非常通用的问题,您还可以使用它来构建加密,基于身份的加密,用户身份验证,遗忘的传输,消息身份验证代码以及可能的更多构造。 而且,与因式分解问题不同,使用量子计算机无法轻松解决LPN问题。 加上轻巧的特性,它是构建后量子安全算法的理想选择。 因此,不用担心,如果基于量子分解的RSA在量子计算机的存在下死了。 ;)

For more information and a better, mathematical definition of the LPN problem, please refer to my dissertation [2].

有关LPN问题的更多信息和更好的数学定义,请参阅我的论文[2]。

实验 (Experiments)

Let us first define an LPN oracle, i.e. a class that we can feed with a secret key and an error level p upon instantiation, which gives us as many samples as we want.

让我们首先定义一个LPN oracle, 一个我们可以在实例化时提供一个秘密密钥和一个错误级别p的类,它为我们提供了所需的任意数量的样本。

This can easily be done using the following code:

使用以下代码可以轻松完成此操作:

import numpy as npclass LPNOracle:def __init__(self, secret, error_rate):
self.secret = secret
self.dimension = len(secret)
self.error_rate = error_rate def sample(self, n_amount):# Create random matrix.A = np.random.randint(0, 2, size=(n_amount, self.dimension)) # Add Bernoulli errors.e = np.random.binomial(1, self.error_rate, n_amount) # Compute the labels.b = np.mod(A @ self.secret + e, 2) return A, b

We can now instantiate this an oracle with a random secret of length 16 and p=0.125.

现在,我们可以实例化一个长度为16且p = 0.125的随机秘密的oracle。

p = 0.125
dim = 16
s = np.random.randint(0, 2, dim)
lpn = LPNOracle(s, p)

We can now sample from the lpn :

我们现在可以从lpn

lpn.sample(3)my output: (array([[1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1],
[1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0],
[1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0]]),
array([1, 1, 1], dtype=int32))

Here we have sampled 3 data points. Now, let us try to find s using a Decision Tree. Why? Just because it’s fast and Logistic Regression and Bernoulli Naive Bayes did not work for me.

在这里,我们采样了3个数据点。 现在,让我们尽力去找利用决策树 。 为什么? 仅仅因为它很快并且Logistic回归和Bernoulli Naive Bayes都不适合我。

from sklearn.tree import DecisionTreeClassifierdt = DecisionTreeClassifier()# Get 100000 samples.A, b = lpn.sample(100000)# Fit the tree.dt.fit(A, b)# Predict all canonical unit vectors (1, 0, 0, ..., 0), (0, 1, 0 ,0, ..., 0), ..., (0, 0, ..., 0, 1).s_candidate = dt.predict(np.eye(dim))# Check if the candidate solution is correct.if np.mod(A @ s_candidate + b, 2).sum() < 14000:
print(s_candidate, s)else:
print('Wrong candidate. Try again!')

The learning algorithm might fail to capture the ground truth and learn another function. In this case, the so-called Hamming weight of the vector is quite high (around 50000 for our vector of length 100000). This corresponds to the case where the tag T has the wrong key and can answer about half of the challenges correctly. If s_candidate = s, the Hamming weight will be around 0.125 * 100000 = 12500.

学习算法可能无法捕获基本事实并学习其他功能。 在这种情况下,向量的所谓汉明权重非常高(对于长度为100000的向量,大约为50000) 这对应于标签T具有错误密钥并且可以正确回答大约一半挑战的情况。 如果s _candidate = s,则汉明权重将围绕0.125 * 100000 = 12500。

Having a threshold of 14000 in this example is a good tradeoff between recognizing the correct secret and not outputting a wrong candidate as the solution. You can find how to get this bound in [2, page 23].

在此示例中,将阈值设置为14000是在识别正确的机密与不输出错误的候选者作为解决方案之间的良好折衷。 您可以在[2,第23页]中找到如何获得此约束。

结论 (Conclusion)

We have defined the LPN problem and seen how it arises when trying to break the cryptographic HB Protocol. Then we have solved a small instance using a simple Decision Tree.

我们已经定义了LPN问题,并看到了在尝试破坏密码HB协议时它是如何产生的。 然后,我们使用简单的决策树解决了一个小实例。

But the journey just starts here: We can use other/better algorithms (deep learning, anyone?) or clever tricks to

但是旅程只是从这里开始:我们可以使用其他/更好的算法(深度学习,有人吗?)或巧妙的技巧来

  • get higher success rates,

    获得更高的成功率,
  • using fewer samples and

    使用更少的样本
  • being able to solve problems with higher dimensions.

    能够解决更大尺寸的问题。

For a list and explanations of non-machine learning algorithms to solve LPN, check out my dissertation [2]. Also, if you want fame, try to solve an instance with a secret length of 512 and p=0.125, for example. This LPN instance is currently unbroken and used for some real-world cryptosystems. Good luck! ;)

有关解决LPN的非机器学习算法的列表和说明,请参见我的论文[2]。 另外,如果要成名,请尝试解决一个秘密长度为512且p = 0.125的实例。 该LPN实例当前未中断,并用于某些实际的密码系统。 祝好运! ;)

I hope that I could make you interested in the overlap of cryptography and machine learning!

希望我能使您对密码学和机器学习的重叠感兴趣!

Thanks for reading!

谢谢阅读!

If you have any questions, write me on LinkedIn!

如有任何疑问,请在 LinkedIn上 给我写信

Also, check out my other articles on graspable machine learning topics!

另外,请查看其他有关可掌握的机器学习主题的文章!

翻译自: https://towardsdatascience.com/where-machine-learning-meets-cryptography-b4a23ef54c9e

美团机器学习实践 密码

你可能感兴趣的:(机器学习,python,java,人工智能,深度学习)