文本挖掘和文本分析与nlp

问题概述 (Problem overview)

Recently, we have been experiencing numerous breakthroughs in Natural Language Processing (NLP) owing to the evolution of Deep Learning (DL). The successes emerged from word2vec, or distributed word representation, which is capable of projecting discrete words into vector space. Such mappings have revolutionized the understanding and manipulation of syntactic and semantic relations among words. One famous example is we can play the following equations of word embeddings in word2vec:

最近，由于深度学习(DL)的发展，我们在自然语言处理(NLP)中经历了许多突破。成功来自于word2vec(即分布式单词表示)，它能够将离散单词投射到向量空间中。这种映射彻底改变了单词间句法和语义关系的理解和操纵。一个著名的例子是我们可以在word2vec中播放以下单词嵌入方程：

Powered by this technique, a myriad of NLP tasks have achieved human parity and are widely deployed on commercial systems [2,3].

在这种技术的支持下，无数的NLP任务已经实现了人类平等，并广泛部署在商业系统上[2,3]。

The core of the accomplishments is representation learning, which is able to extract the necessary information, such as semantics, sentiment, intent, etc, required by task. However, because of the over-parameterization, DL models also memorize certain unnecessary but sensitive attributes, such as gender, age, location, etc.

成就的核心是表示学习，它能够提取任务所需的必要信息，例如语义，情感，意图等。但是，由于参数过多，DL模型还存储了某些不必要但敏感的属性，例如性别，年龄，位置等。

The private information can be explored by malicious parties in different settings. Firstly, cloud AI services have been widespread. Users can easily annotate their unlabelled datasets via cloud AI platforms such as Microsoft Cognitive Services, Google Cloud API, etc. However, if eavesdroppers intercept the immediate representation of users’ inputs from cloud AI services, they can perform some reverse engineering to obtain the original text. Considering privacy concerns, users are unwilling to upload their data to servers. Instead, they can transmit their extracted representations to servers. Nevertheless, the input representation after the embedding layer or the intermediate hidden representation may still carry sensitive information that can be exploited for adversarial usages. It has been justified that an attacker can recover private variables with higher than chance accuracy, using only the hidden representation [4,5]. Such an attack would occur in scenarios where end-users send their learned representations to the cloud for grammar correction, translation, or text analysis tasks, as shown in Fig.1.

恶意方可以在不同的环境中浏览私人信息。首先，云AI服务已经普及。用户可以通过Microsoft认知服务，Google Cloud API等云AI平台轻松注释其未标记的数据集。但是，如果窃听者截取了来自云AI服务的用户输入的直接表示，则他们可以执行一些反向工程以获得原始内容。文本。考虑到隐私问题，用户不愿意将其数据上传到服务器。相反，他们可以将其提取的表示传输到服务器。但是，嵌入层之后的输入表示或中间隐藏表示仍可能携带敏感信息，可用于对抗性使用。已经证明，攻击者可以仅使用隐藏表示[4,5]来以高于偶然性的准确性来恢复私有变量。如图1所示，在最终用户将其学习的表示形式发送到云中进行语法校正，翻译或文本分析任务的情况下，会发生这种攻击。

Fig.1 Inference attack on representation.

图1表示攻击。

如何在NLP中保护代表隐私 (How to preserve representation privacy in NLP)

More recently, Li et al. [4] and Coavoux et al. [5] proposed to train deep models with adversarial learning. However, both works provide only empirical privacy, without any formal privacy guarantees. To address this issue and protect privacy against an untrusted server and the eavesdropper, we are inspired to take a different approach by utilizing Local Differential Privacy (LDP) defined as follows.

最近，李等人。 [4]和Coavoux等。 [5]提议用对抗性学习来训练深度模型。但是，这两部作品均仅提供经验性隐私，没有任何正式的隐私保证。为了解决此问题并保护隐私不受不受信任的服务器和窃听者的侵害，我们受到启发，通过采用如下定义的本地差异隐私(LDP)采取不同的方法。

Compared to the centralized DP (CDP) adopted by Google [1], LDP offers a stronger level of protection. As illustrated in Fig.2, in DL with CDP, the trusted server owns the data of all users [1], and the server implements CDP algorithm before answering queries from end-users. This approach can pose a privacy threat to data owners when the server is untrusted. By contrast, in DL with LDP, data owners are willing to contribute their data for social good but do not fully trust the server, so it necessitates data perturbation before releasing it to the server for further learning.

与Google [1]采用的集中式DP(CDP)相比，LDP提供了更强的保护级别。如图2所示，在具有CDP的DL中，受信任的服务器拥有所有用户的数据[1]，并且服务器在回答最终用户的查询之前实施CDP算法。当服务器不受信任时，此方法可能对数据所有者构成隐私威胁。相比之下，在具有LDP的DL中，数据所有者愿意为社会公益贡献自己的数据，但并不完全信任服务器，因此在将数据发布给服务器以供进一步学习之前，它需要对数据进行扰动。

Fig. 2: Deep Learning with CDP and LDP.

图2：使用CDP和LDP进行深度学习。

Basically, in order to achieve privacy protection, LDP employs a protocol named Unary Encoding (UE), which is comprised of two steps:

基本上，为了实现隐私保护，LDP使用一种名为一元编码(UE)的协议，该协议包括两个步骤：

Encoding: one can encode an input into a d-bit vector, where only one digit is 1, and the rest elements are all 0’s.
编码：可以将输入编码为d位向量，其中只有一位为1，其余元素均为0。
Perturbing: the one-valued digit can be flipped with a probability of (1-p), while zeros are preserved with a probability of (1-q).
扰动：可以以(1-p)的概率翻转一值数字，而以(1-q)的概率保留零。

Depending on the choice of p and q, UE based LDP protocols can be classified into [6] :

根据p和q的选择，基于UE的LDP协议可以分为[6]：

Symmetric UE (SUE): p and q must satisfy equality, p+q=1
对称UE(SUE)：p和q必须满足等式，p + q = 1
Optimized UE (OUE): Setting p and q can be viewed as splitting ε into ε1+ε2 such that
优化UE(OUE)：将p和q设置为将ε分为ε1+ε2

However, both SUE and OUE are dependent on the domain size d, which may not scale well when d is large. To remove the dependence on d, we propose a new LDP protocol called Optimized Multiple Encoding (OME) [7]. The key idea is to map each real value vi of the embedding vector into a binary vector with a fixed size l. Then the privacy protection can be accomplished via the following perturbation:

但是，SUE和OUE都取决于域大小d，当d大时，域大小可能无法很好地扩展。为了消除对d的依赖，我们提出了一种新的LDP协议，称为优化多重编码(OMED)[7]。关键思想是将嵌入向量的每个实值vi映射为大小为l的二进制向量。然后可以通过以下摄动来实现隐私保护：

where λ and ε are tunable hyperparameters.

其中λ和ε是可调超参数。

我们提出的框架 (Our proposed framework)

As shown in Fig. 3, the general setting for our proposed deep learning with LDP consists of three main modules: (1) embedding module outputs a 1-D real representation with length r; (2) randomization module produces local differentially private representation; and (3) classifier module trains on the randomized binary representations to generate a differentially private classifier.

如图3所示，我们建议的LDP深度学习的一般设置包括三个主要模块：(1)嵌入模块输出长度为r的一维实数表示； (2)随机化模块产生局部差分私有表示； (3)分类器模块训练随机化的二进制表示以产生差分私有分类器。

Fig. 3: General setting for deep learning with LDP.

图3：使用LDP进行深度学习的常规设置。

绩效评估 (Performance evaluation)

To examine the performance of our proposed local differentially private NN (LDPNN), we first compare it with the non-private NN (NPNN), where the randomization module is removed. We evaluate these two models on three NLP tasks: 1) sentiment analysis (IMDb, Amazon, and Yelp dataset), 2) intent detection (Intent dataset) and 3) paraphrase identification (MRPC dataset). Table 1 shows that our LDPNN delivers comparable or even better results than the NPNN across various privacy budgets ϵ when the randomization factor λ ≥ 50. We hypothesize that LDP acts as a regularization technique to avoid overfitting.

为了检查我们提出的局部差分私有NN(LDPNN)的性能，我们首先将其与非私有NN(NPNN)进行比较，后者去除了随机化模块。我们在三个NLP任务上评估这两个模型：1)情绪分析(IMDb，Amazon和Yelp数据集)，2)意图检测(Intent数据集)和3)复述识别(MRPC数据集)。表1显示，当随机因子λ≥50时，在各种隐私预算中our，我们的LDPNN可以提供比NPNN更好甚至更好的结果。我们假设LDP可以作为一种正则化技术来避免过度拟合。

Secondly, we also compare with the other two LDP protocols, i.e. SUE and OUE, on sentiment analysis tasks. Table 2 suggests that our OME significantly outperforms both SUE and OUE.

其次，在情感分析任务上，我们还与其他两个LDP协议SUE和OUE进行了比较。表2表明我们的OME明显优于SUE和OUE。

Conclusion

结论

To conclude, we formulated a new deep learning framework, which allows data owners to send differentially private representations for further learning on the untrusted servers. A novel LDP protocol was proposed to adjust the randomization probabilities of the binary representation while maintaining both high privacy and accuracy under a wide range of privacy budgets. Experimental results on various NLP tasks confirm the effectiveness and superiority of our framework.

总而言之，我们制定了一个新的深度学习框架，该框架允许数据所有者发送差异化的私有表示形式，以便在不受信任的服务器上进行进一步的学习。提出了一种新颖的LDP协议来调整二进制表示形式的随机概率，同时在广泛的隐私预算范围内保持较高的隐私性和准确性。在各种NLP任务上的实验结果证实了我们框架的有效性和优越性。

在哪里找到文件和代码？ (Where to find the paper and code?)

Paper: http://arxiv.org/abs/2006.14170

论文： http : //arxiv.org/abs/2006.14170

Code: https://github.com/lingjuanlv/Differentially-Private-Text-Representations

代码： https : //github.com/lingjuanlv/Differentially-Private-Text-Representations

[1] Martín Abadi, Andy Chu, Ian Goodfellow, HBrendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of CCS. ACM, 308–318.

[1]马丁·阿巴迪，朱迪，伊恩·古德费洛，HBrendan McMahan，伊利亚·米罗诺夫，库纳尔·塔瓦尔和李章。 2016。深度学习与不同的隐私。在CCS论文集中。 ACM，308-318。

[2] Peters, Matthew et al. “Deep Contextualized Word Representations.” Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (2018): n. pag. Crossref. Web.

[2] Peters，Matthew等。 “深度上下文化的单词表示形式。” 计算语言学协会北美分会2018年会议论文集：人类语言技术，第1卷(长论文)(2018年)：n。帕格交叉引用。网络。

[3] Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (2019): n. pag. Crossref. Web.

[3] Devlin，Jacob等。 “ BERT：深层双向变压器的预训练以提高语言理解能力”，计算语言学协会北美分会2019年会议论文集：人类语言技术，第1卷(长论文)(2019年)：n。帕格交叉引用。网络。

[4] Yitong Li, Timothy Baldwin, and Trevor Cohn.2018. Towards robust and privacy-preserving text representations. In Proceedings of ACL. 25–30.

[4]李一彤，蒂莫西·鲍德温和特雷弗·科恩，2018年。迈向健壮且保护隐私的文本表示。在ACL论文集中。 25-30。

[5] Maximin Coavoux, Shashi Narayan, and Shay B Cohen.2018. Privacy-preserving neural representations of text. In Proceedings of EMNLP. 1–10.

[5] Maximin Coavoux，Shashi Narayan和Shay BCohen。2018年。文本的保护隐私的神经表示。在EMNLP论文集中。 1-10。

[6] Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. 2017. Locally differentially private protocols for frequency estimation. In USENIX Security. 729–745.

[6] Wang Tianhao，Jeremiah Blocki，Ninghui Li和Somesh Jha。 2017。用于频率估计的本地差分专用协议。在USENIX安全性中。 729–745。

[7] L. Lyu, Y. Li, X. He, and T. Xiao, “Towards differentially private text representations,” in SIGIR, 2020.

[7] L. Lyu，Y。Li，X。He和T. Xiao，“迈向差异化私人文本表示”，SIGIR，2020年。

翻译自: https://medium.com/swlh/how-to-preserve-privacy-of-text-representations-in-nlp-3ed244148d75