数据库加密字段的模糊搜索_如何搜索安全加密的数据库字段

数据库加密字段的模糊搜索

This post was originally published on the ParagonIE blog and reposted here with their permission.

该帖子最初发布在ParagonIE博客上,并在获得其许可的情况下在此处重新发布。



We [ParagonIE] get asked the same question a lot (or some remix of it).

我们[ParagonIE]经常被问到相同的问题(或对其进行一些混音)。

This question shows up from time to time in open source encryption libraries’ bug trackers. This was one of the “weird problems” covered in my talk at B-Sides Orlando (titled Building Defensible Solutions to Weird Problems), and we’ve previously dedicated a small section to it in one of our white papers.

这个问题会不时出现在开源加密库的错误跟踪器中 。 这是我在奥兰多B站的演讲中 提到的“怪异问题”之一(标题为“ 为怪异问题构建防御解决方案” ),我们以前在其中一份白皮书中专门针对了其中一小部分。

You know how to search database fields, but the question is, How do we securely encrypt database fields but still use these fields in search queries?

您知道如何搜索数据库字段 ,但问题是, 我们如何安全地加密数据库字段,但仍在搜索查询中使用这些字段?

Our secure solution is rather straightforward, but the path between most teams asking that question and discovering our straightforward solution is fraught with peril: bad designs, academic research projects, misleading marketing, and poor threat modeling.

我们的安全解决方案相当简单,但是大多数团队在提出问题和发现我们的简单解决方案之间的道路充满着危险:不良的设计,学术研究项目,误导性的营销以及不良的威胁建模 。

If you’re in a hurry, feel free to skip ahead to the solution.

如果您很着急,请随时跳过此解决方案 。

迈向可搜索的加密 (Towards Searchable Encryption)

Let’s start with a simple scenario (which might be particularly relevant for a lot of local government or health care applications):

让我们从一个简单的场景开始(这可能与许多地方政府或医疗保健应用特别相关):

  • You are building a new system that needs to collect social security numbers (SSNs) from its users.

    您正在构建一个新系统,该系统需要从其用户那里收集社会保险号(SSN)。
  • Regulations and common sense both dictate that users’ SSNs should be encrypted at rest.

    法规和常识都规定用户的SSN应该在静止状态下进行加密。
  • Staff members will need to be able to look up users’ accounts, given their SSN.

    给定用户的SSN,工作人员将需要能够查找用户的帐户。

Let’s first explore the flaws with the obvious answers to this problem.

首先,让我们用明显的答案来解决这些缺陷。

不安全(或其他不良建议)的答案 (Insecure (or otherwise ill-advised) Answers)

非随机加密 (Non-randomized Encryption)

The most obvious answer to most teams (particularly teams that don’t have security or cryptography experts) would be to do something like this:

对于大多数团队(特别是没有安全性或加密专家的团队),最明显的答案是这样做:

db = $db;
        $this->key = $key;
    }

    public function searchByValue(string $query): array
    {
        $stmt = $this->db->prepare('SELECT * FROM table WHERE column = ?');
        $stmt->execute([
            $this->insecureEncryptDoNotUse($query)
        ]);
        return $stmt->fetchAll(\PDO::FETCH_ASSOC);
    }

    protected function insecureEncryptDoNotUse(string $plaintext): string
    {
        return \bin2hex(
            \openssl_encrypt(
                $plaintext,
                'aes-128-ecb',
                $this->key,
                OPENSSL_RAW_DATA | OPENSSL_ZERO_PADDING
            )
        );
    }
}

In the above snippet, the same plaintext always produces the same ciphertext when encrypted with the same key. But more concerning with ECB mode is that every 16-byte chunk is encrypted separately, which can have some extremely unfortunate consequences.

在以上代码段中,当使用相同的密钥加密时,相同的明文始终会产生相同的密文。 但是,与ECB模式有关的更多问题是,每个16字节的块分别进行了加密 ,这可能会带来一些非常不幸的后果 。

Formally, these constructions are not semantically secure: If you encrypt a large message, you will see blocks repeat in the ciphertext.

形式上,这些构造在语义上不是安全的:如果对大消息进行加密,则会在密文中看到块重复。

In order to be secure, encryption must be indistinguishable from random noise to anyone that does not hold the decryption key. Insecure modes include ECB mode and CBC mode with a static (or empty) IV.

为了安全起见,对于不持有解密密钥的任何人,加密必须与随机噪声没有区别。 不安全模式包括带有静态(或空)IV的ECB模式和CBC模式。

You want non-deterministic encryption, which means each message uses a unique nonce or initialization vector that never repeats for a given key.

您需要非确定性加密,这意味着每条消息都使用一个唯一的随机数或初始化矢量,该矢量永远不会重复给定密钥。

实验性学术设计 (Experimental Academic Designs)

There is a lot of academic research going into such topics as homomorphic, order-revealing, and order-preserving encryption techniques.

有很多学术研究涉及同态 , 顺序公开和顺序保留加密技术等主题。

As interesting as this work is, the current designs are nowhere near secure enough to use in a production environment.

尽管这项工作很有趣,但是当前的设计远没有足够的安全性可以在生产环境中使用。

For example, order-revealing encryption leaks enough data to infer the plaintext.

例如, 顺序公开加密会泄漏足够的数据以推断明文 。

Homomorphic encryption schemes are often repackaging vulnerabilities (practical chosen-ciphertext attacks) as features.

同态加密方案通常将漏洞(实用的选择密文攻击)重新打包为功能。

  • Unpadded RSA is homomorphic with respect to multiplication.

    对于乘法,未填充的RSA是同态的。

    • If you multiply a ciphertext by an integer, the plaintext you get will be equal to the original message multiplied by the same integer. There are several possible attacks against unpadded RSA, which is why RSA in the real world uses padding (although often an insecure padding mode).

      如果将密文乘以整数,则得到的明文将等于原始消息乘以相同的整数。 针对未填充的RSA有几种可能的攻击 ,这就是为什么RSA在现实世界中使用填充(尽管通常是不安全的填充模式 )的原因。

    Unpadded RSA is homomorphic with respect to multiplication.

    对于乘法,未填充的RSA是同态的。

  • AES in Counter Mode is homomorphic with respect to XOR.

    计数器模式下的AES 对于XOR是同态的。

    • This is why nonce-reuse defeats the confidentiality of your message in CTR mode (and non-NMR stream ciphers in general).

      这就是为什么随机数重用会破坏您在CTR模式下(通常是非NMR流密码)机密性的原因。

    AES in Counter Mode is homomorphic with respect to XOR.

    计数器模式下的AES 对于XOR是同态的。

As we’ve covered in a previous blog post, when it comes to real-world cryptography, confidentiality without integrity is the same as no confidentiality. What happens if an attacker gains access to the database, alters ciphertexts, and studies the behavior of the application upon decryption?

正如我们在之前的博客文章中所讨论的那样 ,当涉及到现实世界的加密时, 没有完整性的机密就等于没有机密性 。 如果攻击者获得对数据库的访问权,更改密文并在解密后研究应用程序的行为,会发生什么?

There’s potential for ongoing cryptography research to one day produce an innovative encryption design that doesn’t undo decades of research into safe cryptography primitives and cryptographic protocol designs. However, we’re not there yet, and you don’t need to invest into a needlessly complicated research prototype to solve the problem.

正在进行的密码学研究有一天有可能产生一种创新的加密设计,这种设计不会撤销对安全密码学原语和密码协议设计的数十年研究。 但是,我们还不存在,您不需要投资不必要的复杂研究原型来解决问题。

丢人的提法:解密每一行 (Dishonorable Mention: Decrypt Every Row)

I don’t expect most engineers to arrive at this solution without a trace of sarcasm. The bad idea here is, because you need secure encryption (see below), your only recourse is to query every ciphertext in the database and then iterate through them, decrypting them one-by-one and performing your search operation in the application code.

我不希望大多数工程师都能毫不讽刺地提出这个解决方案。 这里的一个坏主意是,因为需要安全加密(请参见下文),所以唯一的办法就是查询数据库中的每个密文,然后对其进行遍历,然后将其解密,然后对它们进行解密,然后在应用程序代码中执行搜索操作。

If you go down this route, you will open your application to denial of service attacks. It will be slow for your legitimate users. This is a cynic’s answer, and you can do much better than that, as we’ll demonstrate below.

如果沿着这条路线走,您将打开应用程序以拒绝服务攻击。 对于您的合法用户,这将很慢。 这是一个愤世嫉俗的答案,您可以做的比这更好,我们将在下面演示。

安全的可搜索加密变得轻松 (Secure Searchable Encryption Made Easy)

Let’s start by avoiding all the problems outlined in the insecure/ill-advised section in one fell swoop: All ciphertexts will be the result of an authenticated encryption scheme, preferably with large nonces (generated from a secure random number generator).

让我们一口气避免不安全/不明智的部分中概述的所有问题:所有密文都是经过身份验证的加密方案的结果,最好是使用大随机数(由安全随机数生成器生成 )。

With an authenticated encryption scheme, ciphertexts are non-deterministic (same message and key, but different nonce, yields a different ciphertext) and protected by an authentication tag. Some suitable options include: XSalsa20-Poly1305, XChacha20-Poly1305, and (assuming it’s not broken before CAESAR concludes) NORX64-4-1. If you’re using NaCl or libsodium, you can just use crypto_secretbox here.

使用经过身份验证的加密方案,密文是不确定的(相同的消息和密钥,但是不同的随机数,会产生不同的密文),并且受身份验证标签保护。 一些合适的选项包括:XSalsa20-Poly1305,XChacha20-Poly1305,以及(假设在CAESAR得出结论之前没有损坏)NORX64-4-1。 如果您使用的是NaCl或libsodium,则可以在此处使用crypto_secretbox

Consequently, our ciphertexts are indistinguishable from random noise, and protected against chosen-ciphertext attacks. That’s how secure, boring encryption ought to be.

因此,我们的密文随机噪声是无法区分的 ,并且可以防止选择密文攻击 。 那应该是多么安全,无聊的加密。

However, this presents an immediate challenge: We can’t just encrypt arbitrary messages and query the database for matching ciphertexts. Fortunately, there is a clever workaround.

但是,这提出了直接的挑战:我们不能只加密任意消息并查询数据库以查找匹配的密文。 幸运的是,有一个聪明的解决方法。

重要提示:威胁模型您的加密使用情况 (Important: Threat Model Your Usage of Encryption)

Before you begin, make sure that encryption is actually making your data safer. It is important to emphasize that “encrypted storage” isn’t the solution to securing a CRUD app that’s vulnerable to SQL injection. Solving the actual problem (i.e. preventing the SQL injection) is the only way to go.

在开始之前,请确保加密实际上使您的数据更安全。 需要强调的是,“加密存储”并不是保护易受SQL注入攻击的CRUD应用程序的解决方案。 解决实际问题(即防止SQL注入 )是唯一的方法。

If encryption is a suitable security control to implement, this implies that the cryptographic keys used to encrypt/decrypt data are not accessible to the database software. In most cases, it makes sense to keep the application server and database server on separate hardware.

如果加密是要实施的合适安全控制措施,则意味着用于加密/解密数据的加密密钥不可被数据库软件访问。 在大多数情况下,将应用程序服务器和数据库服务器保持在单独的硬件上是有意义的。

实施加密数据的文字搜索 (Implementing Literal Search of Encrypted Data)

Possible use-case: Storing social security numbers, but still being able to query them.

可能的用例:存储社会安全号码,但仍然能够查询它们。

In order to store encrypted information and still use the plaintext in SELECT queries, we’re going to follow a strategy we call blind indexing. The general idea is to store a keyed hash (e.g. HMAC) of the plaintext in a separate column. It is important that the blind index key be distinct from the encryption key and unknown to the database server.

为了存储加密信息并仍在SELECT查询中使用纯文本,我们将遵循一种称为盲索引的策略。 一般想法是将明文的键控哈希(例如HMAC)存储在单独的列中。 重要的是,盲索引密钥必须与加密密钥不同,并且数据库服务器不知道。

For very sensitive information, instead of a simple HMAC, you will want to use a key-stretching algorithm (PBKDF2-SHA256, scrypt, Argon2) with the key acting as a static salt, to slow down attempts at enumeration. We aren’t worried about offline brute-force attacks in either case, unless an attacker can obtain the key (which must not stored in the database).

对于非常敏感的信息,而不是简单的HMAC,您将希望使用键拉伸算法(PBKDF2-SHA256,scrypt,Argon2),并将键用作静态盐,以减慢枚举的尝试。 在任何一种情况下,我们都不必担心脱机暴力攻击,除非攻击者可以获得密钥(密钥不能存储在数据库中)。

So if your table schema looks like this (in PostgreSQL flavor):

因此,如果您的表模式如下所示(采用PostgreSQL风格):

CREATE TABLE humans (

    humanid BIGSERIAL PRIMARY KEY,
    first_name TEXT,
    last_name TEXT,
    ssn TEXT, /* encrypted */
    ssn_bidx TEXT /* blind index */
);
CREATE INDEX ON humans (ssn_bidx);

You would store the encrypted value in humans.ssn. A blind index of the plaintext SSN would go into humans.ssn_bidx. A naive implementation might look like this:

您将把加密后的值存储在humans.ssn 。 明文SSN的盲索引将进入humans.ssn_bidx 。 天真的实现可能看起来像这样:

prepare('SELECT * FROM humans WHERE ssn_bidx = ?');
    $stmt->execute([$index]);
    return $stmt->fetchAll(PDO::FETCH_ASSOC);
}

A more comprehensive proof-of-concept is included in the supplemental material for my B-Sides Orlando 2017 talk. It’s released under the Creative Commons CC0 license, which for most people means the same thing as “public domain”.

我在B-Sides奥兰多2017演讲中的补充材料中包含更全面的概念证明。 它是根据知识共享CC0许可证发布的,对于大多数人来说,该许可证与“公共领域”具有相同的含义。

安全分析和局限性 (Security Analysis and Limitations)

Depending on your exact threat model, this solution leaves two questions that must be answered before it can be adopted:

根据您的确切威胁模型,此解决方案留下两个必须采用的问题,然后才能被采用:

  1. Is it safe to use, or does it leak data like a sieve?

    使用安全吗,还是像筛子一样泄漏数据?
  2. What are the limitations on its usefulness? (This one is sort of answered already.)

    其用途有哪些限制? (这个已经回答了。)

Given our example above, assuming your encryption key and your blind index key are separate, both keys are stored in the webserver, and the database server doesn’t have any way of obtaining these keys, then any attacker that only compromises the database server (but not the web server) will only be able to learn if several rows share a social security number, but not what the shared SSN is. This duplicate entry leak is necessary in order for indexing to be possible, which in turn allows fast SELECT queries from a user-provided value.

在上面的示例中,假设您的加密密钥和盲索引密钥是分开的,两个密钥都存储在Web服务器中,并且数据库服务器没有任何获取这些密钥的方式,那么任何只会破坏数据库服务器的攻击者( (而不是Web服务器)将只能知道是否有几行共享一个社会保险号,而不是共享的SSN是什么。 为了使索引成为可能,此重复的条目泄漏是必需的,这反过来又允许从用户提供的值进行快速的SELECT查询。

Furthermore, if an attacker is capable of both observing/changing plaintexts as a normal user of the application while observing the blind indices stored in the database, they can leverage this into a chosen-plaintext attack, where they iterate every possible value as a user and then correlate with the resultant blind index value. This is more practical in the HMAC scenario than in the e.g. Argon2 scenario. For high-entropy or low-sensitivity values (not SSNs), the physics of brute force can be on our side.

此外,如果攻击者既可以作为应用程序的普通用户来观察/更改纯文本,又可以观察数据库中存储的盲索引,则他们可以将其利用为选择明文攻击 ,在这种情况下,他们以用户身份遍历每个可能的值。然后与所得的盲指标值相关。 这在HMAC方案中比在例如Argon2方案中更实用。 对于高熵或低灵敏度值(不是SSN), 蛮力物理可以在我们这一边。

A much more practical attack for such a criminal would be to substitute values from one row to another then access the application normally, which will reveal the plaintext unless a distinct per-row key was employed (e.g. hash_hmac('sha256', $rowID, $masterKey, true) could even be an effective mitigation here, although others would be preferable). The best defense here is to use an AEAD mode (passing the primary key as additional associated data) so that the ciphertexts are tied to a particular database row. (This will not prevent attackers from deleting data, which is a much bigger challenge.)

对于这样的犯罪分子,更实际的攻击是将值从一行替换为另一行,然后正常访问应用程序, 除非使用不同的逐行密钥(例如hash_hmac('sha256', $rowID, $masterKey, true)甚至可以在这里hash_hmac('sha256', $rowID, $masterKey, true)有效的缓解作用,尽管其他方式更可取)。 最好的防御方法是使用AEAD模式(将主键作为附加关联数据传递),以便将密文绑定到特定的数据库行。 (这不会阻止攻击者删除数据,这是一个更大的挑战 。)

Compared to the amount of information leaked by other solutions, most applications’ threat models should find this to be an acceptable trade-off. As long as you’re using authenticated encryption for encryption, and either HMAC (for blind indexing non-sensitive data) or a password hashing algorithm (for blind indexing sensitive data), it’s easy to reason about the security of your application.

与其他解决方案泄漏的信息量相比,大多数应用程序的威胁模型应该认为这是可以接受的折衷方案。 只要您使用经过身份验证的加密进行加密,并且使用HMAC(用于对非敏感数据进行盲索引)或密码哈希算法 (用于对敏感数据进行盲索引),就很容易推断出应用程序的安全性。

However, it does have one very serious limitation: It only works for exact matches. If two strings differ in a meaningless way but will always produce a different cryptographic hash, then searching for one will never yield the other. If you need to do more advanced queries, but still want to keep your decryption keys and plaintext values out of the hands of the database server, we’re going to have to get creative.

但是,它确实有一个非常严重的局限性:它仅适用于完全匹配。 如果两个字符串毫无意义地不同,但始终会产生不同的密码哈希,那么搜索一个字符串将永远不会产生另一个字符串。 如果您需要执行更高级的查询,但仍然希望使解密密钥和纯文本值不受数据库服务器的影响,我们将必须发挥创意。

It is also worth noting that, while HMAC/Argon2 can prevent attackers that do not possess the key from learning the plaintext values of what is stored in the database, it might reveal metadata (e.g. two seemingly-unrelated people share a street address) about the real world.

还值得注意的是,尽管HMAC / Argon2可以阻止不掌握密钥的攻击者了解数据库中存储内容的明文值,但它可能会泄露有关以下内容的元数据(例如,两个看似无关的人共享街道地址)现实中。

实施模糊搜索以加密数据 (Implementing Fuzzier Searching for Encrypted Data)

Possible use-case: Encrypting peoples’ legal names, and being able to search with only partial matches.

可能的用例:加密人们的法定姓名,并且仅能进行部分匹配来进行搜索。

Let’s build on the previous section, where we built a blind index that allows you to query the database for exact matches.

让我们在上一节的基础上进行构建,在上一节中,我们建立了一个盲目索引,该索引使您可以查询数据库中的精确匹配项。

This time, instead of adding columns to the existing table, we’re going to store extra index values into a join table.

这次,我们不是将列添加到现有表中,而是将额外的索引值存储到联接表中。

CREATE TABLE humans (
    humanid BIGSERIAL PRIMARY KEY,
    first_name TEXT, /* encrypted */
    last_name TEXT, /* encrypted */
    ssn TEXT, /* encrypted */
);
CREATE TABLE humans_filters (
    filterid BIGSERIAL PRIMARY KEY,
    humanid BIGINT REFERENCES humans (humanid),
    filter_label TEXT,
    filter_value TEXT
);
/* Creates an index on the pair. If your SQL expert overrules this, feel free to omit it. */
CREATE INDEX ON humans_filters (filter_label, filter_value);

The reason for this change is to normalize our data structures. You can get by with just adding columns to the existing table, but it’s likely to get messy.

进行此更改的原因是为了规范化我们的数据结构。 您只需在现有表中添加列即可解决问题,但这很可能会变得凌乱。

The next change is that we’re going to store a separate, distinct blind index per column for every different kind of query we need (each with its own key). For example:

下一个更改是,我们将为所需的每种不同类型的查询(每个都有自己的键)在每列存储一个单独的,不同的盲索引。 例如:

  • Need a case-insensitive lookup that ignores whitespace?

    是否需要忽略空格的不区分大小写的查找?

    • Store a blind index of preg_replace('/[^a-z]/', '', strtolower($value)).

      存储preg_replace('/[^az]/', '', strtolower($value))的盲索引。

  • Need to query the first letter of their last name?

    是否需要查询其姓氏的首字母?

    • Store a blind index of strtolower(mb_substr($lastName, 0, 1, $locale)).

      存储strtolower(mb_substr($lastName, 0, 1, $locale))的盲索引。

  • Need to match “beings with this letter, ends with that letter”?

    是否需要匹配“以这个字母存在,以那个字母结尾”?

    • Store a blind index of strtolower($string[0] . $string[-1]).

      存储strtolower($string[0] . $string[-1])的盲索引。

  • Need to query the first three letters of their last name and the first letter of their first name?

    需要查询姓氏的前三个字母和姓氏的第一个字母吗?

    • You guessed it! Build another index based on partial data.

      你猜到了! 基于部分数据构建另一个索引。

Every index needs to have a distinct key, and great pains should be taken to prevent blind indices of subsets of the plaintext from leaking real plaintext values to a criminal with a knack for crossword puzzles. Only create indexes for serious business needs, and log access to these parts of your application aggressively.

每个索引都需要有一个独特的键,并且应该竭尽全力以防止明文子集的盲目索引将真实的明文值泄漏给具有填字游戏难题的犯罪分子。 仅创建满足严重业务需求的索引,并积极记录对应用程序这些部分的访问。

交易时间记忆 (Trading Memory for Time)

Thus far, all of the design propositions have been in favor of allowing developers to write carefully considered SELECT queries, while minimizing the number of times the decryption subroutine is invoked. Generally, that is where the train stops and most peoples’ goals have been met.

到目前为止,所有设计主张都支持允许开发人员编写经过仔细考虑的SELECT查询,同时最大程度地减少解密子例程的调用次数。 通常,这是火车停下来的地方,大多数人的目标已经实现。

However, there are situations where a mild performance hit in search queries is acceptable if it means saving a lot of disk space.

但是,在某些情况下,如果这意味着节省大量磁盘空间,则在搜索查询中出现性能下降的情况是可以接受的。

The trick here is simple: Truncate your blind indexes to e.g. 16, 32, or 64 bits, and treat them as a Bloom filter:

这里的技巧很简单:将盲索引截断为16、32或64位,然后将它们视为Bloom过滤器 :

  • If the blind indices involved in the query match a given row, the data is probably a match.

    如果查询中涉及的盲索引匹配给定的行,则数据可能是匹配的。

    • Your application code will need to perform the decryption for each candidate row and then only serve the actual matches.

      您的应用程序代码将需要为每个候选行执行解密,然后仅提供实际的匹配项。

    If the blind indices involved in the query match a given row, the data is probably a match.

    如果查询中涉及的盲索引匹配给定的行,则数据可能是匹配的。

  • If the blind indices involved in the query do not match a given row, then the data is definitely not a match.

    如果查询中涉及的盲索引与给定行不匹配,则该数据绝对不是匹配项。

It may also be worth converting these values from a string to an integer, if your database server will end up storing it more efficiently.

如果您的数据库服务器最终将更有效地存储这些值,那么也可能需要将这些值从字符串转换为整数。

结论 (Conclusion)

I hope I’ve adequately demonstrated that it is not only possible to build a system that uses secure encryption while allowing fast queries (with minimal information leakage against very privileged attackers), but that it’s possible to build such a system simply, out of the components provided by modern cryptography libraries with very little glue.

我希望我已经充分证明了,不仅可以构建一个使用安全加密同时允许快速查询的系统(对特权较高的攻击者的信息泄漏最少),而且还可以简单地构建一个这样的系统,现代密码学库提供的几乎不需要胶水的组件。

If you’re interested in implementing encrypted database storage into your software, we’d love to provide you and your company with our consulting services. Contact ParagonIE if you’re interested.

如果您有兴趣在软件中实现加密的数据库存储,我们很乐意为您和您的公司提供咨询服务。 如果您有兴趣,请联系ParagonIE 。

翻译自: https://www.sitepoint.com/how-to-search-on-securely-encrypted-database-fields/

数据库加密字段的模糊搜索

你可能感兴趣的:(数据库,java,python,人工智能,区块链)