Written by {ga=terrovitis}
Set-values are a natural representation of data that appears in a large number of application areas, ranging from retail sales logs to medical records. Datasets with set-values are sparse multidimensional data that usually contain unique combinations of values. Publishing such datasets poses great dangers to the privacy of the individuals that are associated with the records. Adversaries with partial background knowledge can use it to retrieve the whole record that is associated with an individual.
The aim of this article is to provide an overview of the basic anonymization methods for set-valued data. Anonymization methods allow sanitizing set-valued datasets in such a way that
the privacy of the individuals who are associated with the data is guaranteed. The paper describes the basic privacy guaranties that are used in research literature, and briefly presents the state-of-the-art anonymization methods.
Privacy preserving publishing of set-values
As technology infiltrates more and more aspects of our lives, each human activity leaves a digital trace in some repository. Vast amounts of personal data are implicitly or explicitly created each day, and rarely one is aware of the extent of information that is kept and processed about him or her. These personal data give rise to significant concerns about user privacy, since important and sensitive details about one's life are collected and exploited by third parties. The goal of privacy preservation technologies is to provide tools that allow greater control over the dissemination of user data. A promising trend in the field is Privacy Preserving Data Publishing (PPDP), which allows sharing of anonymized data. Anonymizing a dataset is not limited to the removal of direct identifiers that might exist in a dataset, e.g. the name or the Social Security Number of a person. It also includes removing secondary information, e.g. like
age, zipcode that might lead indirectly to the true identity of an individual. This secondary information is often referred to as quasi-identifiers.
A basic attack scenario against user privacy is as follows: An adversary who knows some characteristics of a person, e.g. age and zipcode, searches the anonymized data for records that match her background knowledge. If a single record is found, then she be can certain that the record refers to the person see knows. The sparser the data are, the more unique combinations exist, and the easier it is for an adversary to locate unique records that correspond to specific users. This makes collections of set-values, which are sparse multidimensional data, very hard to anonymize effectively without compromising user privacy [Aggarwal 2007]. There are several methods in research literature that address the challenges of anonymizing set-valued data. I report the basic techniques classified in three categories: anonymization methods that protect against identity disclosure, methods that protect against
attribute disclosure and methods that offer differential privacy.
Protection against identity disclosure
Protection against identity disclosure guarantees that adversaries will not be able to associate specific records with a known individual. The most popular guaranty is k-anonymity [Samarati 2001, Sweeney 2002]. k-anonymity guarantees that each record will be indistinguishable from other k-1 records, with respect to the quasi identifiers. Every combination of quasi identifiers appears 0 or more than k times in the anonymized dataset.
In [He et. al. 2009] a method for transforming set-valued data to a k-anonymous form, the Partition algorithm, is proposed. Partition employs generalization for transforming the data to k-anonymous form. Generalization is the replacement of a group of original values by one new more abstract one. For example, if the residence area of an individual is reported in terms of cities, the city name can be replaced by the country name in anonymized data as in Figure 2. Partition employs local recoding; not all appearances of an original
value are replaced by a generalized one. Partition is a top down algorithm; it starts by considering that all values are generalized to the more generic value of the generalization hierarchy and then drills down the hierarchy until the k-anonymity property nolonger holds.
Despite using a flexible local recoding transformation, Partition has to significantly transform the original data in order to achieve k-anonymity. Records of different cardinalities that contain items from a large domain must end up to groups of identical records of size at least k. In response to this problem [Terrovitis et. al. 2008, Terrovitis et. al. 2010] propose km-anonymity. km-anonymity requires that each combination of up to m quasi identifiers must appear at least k times in the published data. The intuition behind km-anonymity is that there is little privacy gain from protecting against adversaries who already know most of the terms of one record, and significant information loss in the effort to do so. In [Terrovitis et. al. 2008, Terrovitis et. al. 2010] algorithms that rely on generalization using both global and local recoding are proposed.
In [Loukides et. al. 2011, Gkoulalas et. al. 2011] the authors provide an even more targeted approach. The paper assumes that a set of privacy and utility constraints is known by the data publisher. Such assumption holds in specific domains, like that of medical records [Loukides et. al. 2010/II]. Exploiting this explicit knowledge about privacy dangers and utility the authors propose algorithms that achieve protection against identity disclosure with limited information loss.
Protection against attribute disclosure
The data values in a dataset are not usually equally important as personal information. A common distinction in privacy related literature is between quasi identifiers and sensitive values. Quasi identifiers are usually known through several sources and they do not threaten the privacy of an individual. Sensitive values on the other hand, are not considered available through other sources and they reveal important personal information. If such a distinction holds and it is known by the data publisher, then data must also be protected against the disclosure of sensitive attributes. A common guaranty for protecting against sensitive values is l-diversity [Machanavajjhala et. al. 2006]. l-diversity guarantees that any adversary cannot associate her background knowledge with less than l well represented sensitive values. Well-represented is usually defined as a probability threshold: an adversary cannot associate her background knowledge with any sensitive value with probability over 1.
The first anonymization method that provided protection against attribute disclosure in set-valued attributes was proposed in [Ghinita et. al. 2008]. The proposal of [Ghinita et. al. 2008, Ghinita et. al. 2011] relies on separating sensitive values from quasi identifiers as depicted in Tables 4 and 5. The idea of separation was first proposed in the relational context in [Xiao et. al 2006], but it was adjusted and extended in [Ghinita et. al. 2008, Ghinita et. al. 2011] for the set-valued context. The basic idea of proposed the anonymization method is to create clusters of similar records (with respect to quasi identifiers) and then publish at each cluster the quasi identifiers and the sensitive values separately. A benefit of this transformation with respect to generalization and suppression is that it does not require creating groups with identical quasi identifiers. This way the information loss is kept low, even for data of very high cardinality and dimensionality.
Protection against attribute disclosure is also studied in [Cao et. al. 2010]. The proposed guaranty, p-uncertainty is similar to as l-diversity and requires that quasi identifiers cannot be associated with sensitive values with probability over 1/p. The novelty of the approach is that is considers as quasi identifier every record subset that appears in the dataset, including the ones that contain sensitive values. This means that sensitive values can also act as quasi identifiers. The proposed anonymization method relies both on generalization and suppression.
A guaranty that provides protection both from identity and attribute disclosure, the (h,k,p)-coherence, is proposed in [Xu et. al. 2008/I, Xu et. al. 2008/II]. (h,k,p)-coherence. Similarly to km-anonymity it protects from adversaries how know up to p terms, by guaranteeing that every combination of items will appear at least k times. Moreover, (h,k,p)-coherence guarantees that combinations of up to p items cannot be associated with a sensitive value with probability over h. The proposed anonymization method relies solely on suppression.
In [Loukides et. al. 2010/I] an anonymization method that can be tailored to specific sensitive inferences is presented. The authors propose the notion of PS-rules, which are sensitive association rules specified by the data owner. The anonymization procedure guarantees that an adversary will not be able to infer these rules with high certainty. The proposed anonymization algorithms are based on generalization.
Differential privacy in for set-valued attributes
Differential privacy [Dwork et. al. 2006] is a more generic guaranty than k-anonymity
and l-diversity and the protection it offers is broader than protection against identity and attribute disclosure. It does not make any assumptions about the adversary's background knowledge (although anonymization methods that provide differential privacy might make some implicit ones [Kifer et. al. 2011]), and it requires that any analysis of the anonymized data is not significantly affected by the addition or removal of one record in the original data. The trade-off for providing such a strong anonymity guaranty is that the information quality of the result if significantly affected. Only the results of specific analysis can be provided, by adding noise in a non-deterministic procedure.
A first approach to the anonymization of set-values by using differential privacy appears in [Korolova et. al. 2009], in the context of web search query logs. Each record is the web search history of a single user, which can be modeled as a set value, but the result of the anonymization is not set-values. Only the frequent query terms and they support (with the addition of Laplace noise) is published. A more recent approach that publishes anonymized set values under differential privacy and provides set-values as a result, appears in [Chen 2011]. The paper proposes DiffPart, a top-down algorithm that decides which frequent itemsets can be published, without violating differential privacy. The supports of all published items are distorted by adding Laplace noise.