Privacy preserving microdata publication has received wide attention [1-5]. Each record corresponding to one individual has a number of attributes, which can be divided into the following three categories: 1) identity attributes (e.g., Name and SSN)whose values can uniquely identify an individual; 2) quasiidentifier(QI) attributes (e.g., demographic attributes such as ZIP code, age, and gender) whose values when taken together can potentially identify an individual; 3) sensitive attributes(e.g., disease and income) that indicate confidential information of individuals.
Before microdata are released, identity attributes are often directly removed to preserve privacy of individuals whose data are in the released table. However, the QI information may be used by attackers to link to other public data sets to get the private information of individuals. This is recognized as linking attacks in microdata publishing. Two types of information disclosures have been identified under linking attacks: identity disclosure and attribute disclosure [6]. Identity disclosure occurs if attackers can identify an individual from the released data.Attribute disclosure occurs when confidential information about an individual is revealed and can be attributed to the individual.Samarati and Sweeney [1] proposed the k-anonymity model and presented a generalization approach that divides tuples into QI-group by transforming their QI-values into less specific forms,such that tuples in the same QI-group cannot be uniquely identified by attackers to counter linking attacks based on quasi-identifiers.It was identified that k-anonymity is vulnerable to homogeneity and background knowledge attacks when data in the QI-group lacks diversity in the sensitive attributes [2]. ldiversity[2], as well as following models (e.g., t-closeness [7]),were proposed to protect attribute disclosure. l-diversity requires that the sensitive attribute has at least one well-represented values for each QI-group in the generalized dataset.
The randomization approach has also been adopted to publish microdata [8-12]. Instead of generalizing QI attribute values,randomization approach distorts the original value to another domain value according to some distortion probabilities.The application of the randomization technique was studied to prevent identity disclosure under linking attacks in data publishing [12]. They focused on evaluating the risk of successfully linking a target individual to the index of his record given values of QI attributes. Our research moves one step further to investigate attribute disclosure under linking attacks. We focus on evaluating the risk of successfully predicting the sensitive attribute value of a target individual given his QI attribute values.We present a general randomization framework and give efficient solutions to determine optimal randomization parameters for both QI and sensitive attributes. Thus, we can maximize data utility, while satisfying privacy preservation requirements for sensitive attributes. We compare our randomization approach with other anonymization approaches within the framework(e.g., two representative approaches l-diversity [2] and anatomy[5] are used in this paper). Our result shows the randomization approach can better recover the distribution of the original data set from the disguised one. Thus, intuitively, it might yield a disguised database with higher data utility than l-diversity generalization and anatomy.
Our contributions are summarized as follows. We present a systematic study of the randomization method in preventing attribute disclosure under linking attacks. We propose the use of a specific randomization model and present an efficient solution to derive distortion parameters to satisfy requirements for privacy preservation, while maximizing data utilities. We propose a general framework and present a uniform definition for attribute disclosure which is compatible for both randomization and generalization models. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation(under the same privacy requirements) from three aspects(reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.
The remainder of this paper is organized as follows. In Section II, we discuss closely related work on group based anonymization approaches and randomization approaches in privacy preservation data publishing. In Section III, we present background on randomization based distortions, including analysis and attacks on the randomized data. In Section IV, we first quantify attribute disclosure under linking attacks and then show our theoretical results on maximizing utility with privacy constraints. In Section V, we conduct empirical evaluations and compare the randomization based distortion with two representative group based anonymization approaches (l-diversity [2]and anatomy [5]). We conclude our work in Section VI.
k-anonymity was proposed [1] to counter linking attacks for data publishing. The method generalizes the values of quasiidentifier attributes to less-specific ones, so that each individual cannot be distinguished from at least k - 1 other individuals based on quasi-identifier information. There has been much study in designing efficient algorithms for k-anonymity using generalization and suppression techniques [2, 3, 5, 13, 14].
k-anonymity provides protection against identity disclosure,while it contributes little to attribute disclosure. l-diversity [2] is proposed to counter attribute disclosure. A table is l-diverse if,in each QI-group, at most 1/l of the tuples possesses the most frequent sensitive value. Thus, an individual can be linked to his sensitive value correctly with probability at most 1/l. Similarly,the notation of t-closeness [7] requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold t). The anatomy method was proposed to protect attribute disclosure by breaking the link between quasi-identifiers and sensitive attributes [5]. Anatomy releases all the quasi-identifier and sensitive values directly in two separate tables: a quasiidentifier table (QIT) and a sensitive table (ST). They presented an algorithm to compute anatomized tables that satisfy the l-diversity requirement, while minimizing the error of reconstructing the data.
Kifer and Gehrke [15] investigated the problem of injecting additional information into k-anonymous and l-diverse tables to improve the utility of published data. Koudas et al. [16] studied the problem of preserving privacy through sensitive attribute permutation and generalization with the addition of fake sensitive attribute values. Narayanan and Shmatikov [17] presented a de-anonymization methodology for sparse multi-dimensional microdata, such as individual transactions, preferences, and so on. The de-anonymization algorithm could be used by attackers against data sets containing anonymous high-dimensional data.Brickell and Shmatikov [18] compared the privacy loss (defined by certain kinds of information learned by attackers) with the utility gain (defined as the same kinds of information learned by analysts) caused by data anonymization and concluded that“even modest privacy gains require almost complete destruction of the data mining utility” in generalization methods. Li and Li[19] proposed the use of the worst-case privacy loss rather than the average privacy loss among all individuals adopted in [18]to measure privacy loss. They further presented a methodology to evaluate the privacy-utility tradeoff, where privacy loss is quantified by the adversary’s knowledge gain about the sensitive values of specific individuals with the trivially-anonymized data as the baseline and utility loss is measured by the information loss about the sensitive values of large populations with the original data as the baseline. In our paper, we use 1/l to bound the attribute disclosure for all records and use the difference of distributions between the original data and the reconstructed data to quantify utility loss. We also examine the utility loss from the perspective of answering aggregate queries and preserving correlations between the sensitive attribute and QI attributes.
In this paper, we focus on the randomization model in resisting attribute disclosure of data publishing. The proposed randomization model is compared to traditional l-diversity [2] and anatomy [5] models. Recently, Ghinita et al. [14] proposed a framework to solve the privacy-constrained and accuracy constrained data anonymization problems under the k-anonymity and l-diversity models. They developed efficient heuristics to solve the one-dimensional problems in linear time and generalized solutions to multi-dimensional attributes using space-mapping techniques. Their empirical evaluation results showed that privacy/accuracy-constrained methods outperform existing generalization approaches in terms of the execution time and information loss. We would also point out that researchers have developed various anonymity models, including p-sensitive k-anonymity[20], (α, k)-anonymity [21], (k, e)-anonymity [4], (c,k)-safety [22], m-confidentiality [23], privacy skyline [24], δ-presence [25], (ε, m)-anonymity [26], and (t, λ)-uniqueness[27]. Detailed comparisons to these works are beyond the scope of this paper.
The first randomization model, called Randomized Response(RR), was proposed by Warner [28] in 1965. The model deals with one dichotomous attribute, i.e., every person in the population belongs either to a sensitive group A, or to its complement ？. The problem is to estimate π_{A}, the unknown proportion of population members in group A. Each respondent is provided with a randomization device by which the respondent chooses one of the following two questions Do you belong to A or Do you belong to ？ with respective probabilities p and 1 - p and then replies yes or no to the question chosen. The technique provides response confidentiality and increases respondents’ willingness to answer sensitive questions, because no one but the respondent knows to which question the answer pertains. It has been extended to the field of privacy preserving data mining[29].
Aggarwal and Yu [30] provided a survey of randomization models for privacy preservation data mining, including various reconstruction methods for randomization, adversarial attacks,optimality and utility of randomization. Rizvi and Haritsa [9]developed the MASK technique to preserve privacy for frequent item set mining. Agrawal and Haritsa [31] presented a general framework of random perturbation in privacy preservation data mining. Du and Zhan [8] studied the use of a randomized response technique to build decision tree classifiers. Guo et al. [10] investigated data utility in terms of the accuracy of reconstructed measures in privacy preservation market basket data analysis. Aggarwal [32] proposed adding noise from a specified distribution to produce a probabilistic model of k-anonymity.Zhu and Liu [33] investigated the construction of optimal randomization schemes for privacy preservation density estimation and proposed a general framework for randomization using mixture models. Rebollo-Monedero et al. [34]defined the privacy measure similar to t-closeness [7] and formalized the privacy-utility tradeoff from the information theory perspective. The resulting solution turns out to be the postrandomization(PRAM) method [35] in the discrete case and a form of random noise addition in the general case. They also proved that the optimal perturbation is in general randomized,rather than deterministic in the continuous case.
Huang and Du [36] studied the search of optimal distortion parameters to balance privacy and utility. They developed an evolutionary multi-objective optimization method to find optimal distortion matrices when applying the RR technique on a single attribute. The complexity of the algorithm is O(((N_{Q} + N_{V})^{3} + n^{2}) L) where N_{Q} is the population size, NV is the archive size, n is the number of columns in the RR matrix, and L is the maximum number of iterations. However, there is no guarantee that the derived matrices are optimal in the entire search space and the performance tends to be sacrificed for multiple attributes due to the intractable size of the search space. Similarly,Xiao et al. [37] investigated the optimal random perturbation at multiple privacy levels.
In our paper, we focus on a specific randomization model(i.e., perturbation retains the original value of an attribute A_{i} with probability p_{i} and replaces the original value with a random value from the domain of A_{i}, see Equation 1) and derive an efficient solution to determine the optimal randomization parameters under linking attacks. Chaytor and Wang [38] also applied the use of this simple randomization model to randomize a sensitive value only within a subset of the entire domain.The developed algorithm, called Small Domain Randomization,can effectively derive optimal partitions, such that randomizing the sensitive attribute values of records (with small domains) in each partition separately can retain more data utility with the same level of privacy than randomizing the whole data set over the entire domain. Our work focuses on attribute disclosure when applying full domain randomization on both QI attributes and sensitive attribute S. The derived optimal randomization parameter values under privacy constraints can be incorporated in the small domain randomization.
Dataset T contains N records and m + 1 categorical attributes: A_{1}, A_{2}, . . . , A_{m}, and S. We use QI = {A_{1}, . . . , A_{m}} to denote the set of quasi-identifier attributes (e.g., demographic) whose values may be known to the attacker for a given individual and use S to denote one sensitive attribute whose value should not be associated with an individual by attackers. Generally, T may also contain other attributes that are neither sensitive nor quasiidentifying.Those attributes are usually kept unchanged in the released data. We exclude them from our setting, since they do not incur privacy disclosure risk or utility loss. All of the discussions in this paper are also explained in the single sensitive attribute setting and can be generalized to multiple sensitive attributes.
Attribute A_{i} has d_{i} categories denoted by 0,1,. . . , d_{i} - 1. We use Ω_{i} to denote the domain of A_{i} (S), Ω_{i} = {0, 1, . . . , d_{i} - 1},and Ω_{QI} = Ω_{1} × · · · × Ω_{m} is the domain of quasi-identifiers. Similarly,attribute S has d_{s} categories denoted by 0,1,. . . , d_{s} - 1. We use Ω_{s} = {0, 1, . . . , d_{s} - 1} to denote the domain of S. The r-th record R_{r} is denoted by (A_{1r}, A_{2r}, . . . , A_{mr}, S_{r}) or simply (Q_{Ir}, S_{r}).Let D = d_{s} Π^{m}_{i-1}d_{i} be the total number of cells in the contingency table.
Let π_{i1, … , i m, is}denote the true proportion corresponding to the categorical combination (A_{1} = i_{1}, · · · , A_{m} = i_{m}, S = i_{s}). Let π_{i1}be the column vector with D elements π_{il}, · · ·, _{im}, _{is} arranged in a fixed order. Table 1 shows one contingency table example for a data set with one QI attribute (Gender, d_{1} = 2) and one sensitive attribute S (Disease, d_{s} = 3). Table 1a shows the original contingency table where π = (π_{00}, π_{01}, π_{02}, π_{10}, π_{11}, π_{12})' corresponds to a fixed order of cell entries π_{ij} in the 2 × 3 contingency table. π_{10} denotes the proportion of records with Female and Cancer. The column sum π_{+0} represents the proportion of records with Cancer across both genders. Note that contingency tables are widely used in statistics to record and analyze the relationship between
categorical attributes. Results of most data mining tasks (e.g.,clustering, decision tree learning, and association rule mining),as well as aggregate queries, are solely determined by the contingency table. That is, analysis on contingency tables is equivalent to analysis on the original categorical data. Table 1b shows one contingency table instance derived from the data set with 100 tuples. We will use this contingency table as an example to illustrate properties of link disclosure.
We use the left part of Fig. 1 to illustrate the process of privacy preserving data publishing. For each record R_{j}, the data owner independently randomizes attribute A_{i} using the distortion matrix P_{i}. Specifically, for attribute A_{i} (or S) with d_{i} categories,the randomization process is to change a record belonging to the v-th category (v = 0, . . . , d_{i} - 1) to the u-th category with probability p_{uv}^{(i)}: Pr(A_{i} = u'A_{i} = v) = p_{uv}^{(i)}. Let p_{i} = [p_{uv}^{(i)}]_{di×di}, and we call P_{i} (or P_{s}) the distortion matrix for A_{i} (or S). Naturally,the column sums of P_{i} are equal to 1. The original database T is changed to T, and then both the randomized data set and the distortion matrices are published. The randomization matrices indicate the magnitude of the randomization; this can help data analysts estimate the original data distribution.
Let λ denote the contingency table of the randomized data T. We arrange λ into the column vector with the same order of π.. Table 1c shows one example of the randomized contingency table. The randomized contingency table has a close relationship to the original contingency table and the randomization matrices: E(λ) = P π where P = P_{1} ？ · · · ？ P_{m} ？ P_{s}, and ？stands for the Kronecker product. The Kronecker product is an operation on two matrices, an m-by-n matrix A and a p-by-q matrix B, resulting in the mp-by-nq block matrix.
Distortion matrices determine the privacy and utility of the randomized data. How to find optimal distortion parameters with privacy or utility constraints has remained a challenging problem [39]. In this paper, we limit the choice of randomization parameters for each QI attribute A_{i} (and sensitive attribute S) as:
That is, for each attribute A_{i}, all categories have the same probability p_{i} to remain unchanged, and have the same probability q_{i} to be distorted to a different category. With this choice limit, we can derive an efficient algorithm (with explicit formula)to determine the optimal randomization parameters (as shown in Section IV-B).
One advantage of randomization procedure is that the pure random process allows data analysts to estimate the original data distribution based on the released data and the randomization parameters. The right part of Fig. 1 shows how data analysts estimate the original data distribution. With the randomized data T and its contingency table λ, the unbiased estimate of π is given by π = P^{-1}λ and the covariance matrix of the estimator is
where (Pπ)^{δ} stands for the diagonal matrix, whose diagonal values are Pπ.
Thus, data analysts derive the estimation for the distribution of original data (in terms of contingency table) without disclosing the individual information of each record. We choose the accuracy of reconstructed distribution as the target utility function in Section IV-B, because most data mining applications are based on the probability distribution of the data.
Let X be an attribute or a subset of attributes in the data set T with domain Ω_{X}, and X is the randomized value of X in T . It is not reasonable for attackers to regard the observed value as the true value of X, with the randomization process and parameters.Instead, attackers can try to estimate the original value based on the observed data and the released randomization parameters.Let X denote the attackers’ estimation on the original value of X. Any value in Ω_{X} is possible due to the randomization procedure.We assume that the attacker is able to calculate the posterior probability of content in the data set and takes the following probabilistic strategy: for any μ, ν ∈ Ω_{X},
where Pr(X = μ' X= ν) denotes the attacker’s posterior belief on the original value X = μ when he observes X = ν. With the Bayes’ theorem, it can be calculated by:
The following lemma gives the accuracy of the attacker’s estimation.
LEMMA 1. Suppose attackers adopt the probabilistic strategy specified in (3) to estimate the data. The probability that attackers accurately estimate the original value of X is given by:
PROOF. For a particular observed value X = ν ∈ Ω_{X},Pr(X = X = μ, X = ν) = Pr( = ν'X = μ) Pr(X = μ' = ν). Then, with the law of total probability, we have
The probability of attackers’ correct estimation is also defined as the reconstruction probability, Pr(X = μ → X = μ), in [9, 12].It captures the round-trip of going from the original content to the distorted one and then returning to estimate the value of the original content. That is, it indicates how much information of the original data is preserved after randomization. To make the notation concise, we adopt the notation Pr(X= X = μ) to denote the reconstruction probability to evaluate the risk of the sensitive attribute disclosure.
We measure privacy in terms of the attribute disclosure risk,whose formal definition is given, as follows:
DEFINITION 1. The attribute disclosure risk under linking attacks is defined to be the probability that the attacker predicts Sr successfully given QI_{r} of a target individual r, denoted as Pr(S_{r}'QI_{r}).
We need to quantify the background knowledge of attackers to derive the disclosure probability Pr(S_{r}'QI_{r}). We have the following standard assumptions for background knowledge of attackers in this paper. We assume that the attacker has access to the published data set T and he knows that T is a randomized version of some base table T. The attacker knows the domain of each attribute of T. We also assume that the attacker can obtain the QI-values of the target individual (e.g., Alice) from some public database or background knowledge and knows that the target individual is definitely contained in the published data.However, he has no knowledge of which record in the published data belongs to the target individual. Finally, we assume that the distortion matrices P_{i} are available to the attacker, because they are necessary for data miners to conduct analysis. The algorithm of l-diversity in [2] preserves privacy by generalizing the QI attributes to form QI-groups. Individuals in the group are linked to any sensitive attributes with probability at most 1/l, i.e.,Pr(S_{r}'QI_{r}) ≤ 1/l . However, the randomization based approach achieves the privacy protection probabilistically. In the following subsection, we show how to quantify the attribute disclosure risk in the randomization settings.
When there is no randomization applied, for those records with their quasi-identifiers equal to QI_{r}, the attacker simply regards every record as having the same probability to represent individual r. The risk of sensitive attribute disclosure is equal to
because there are π_{QIr} records within the group of QI_{r}, and π_{QIr, Sr} of them have the sensitive value equal to S_{r}. This case corresponds to the worst case of the attribute disclosure risk. When randomization is applied, the attribute disclosure risk will be reduced, because the randomization increases the attacker’s uncertainty.
1) Randomize S only (RR-S): When data owners only apply randomization to the sensitive attribute, for each record within the group of QIr, the attacker takes a guess on its sensitive value using the observed sensitive value and the posterior probability in (4). According to (5), the probability of a correct estimation is Pr( = ？r'QIr), then the risk of sensitive attribute disclosure is
2) Randomize QI only (RR-QI): Similarly as RR-S, when data owners only apply randomization to the quasi-identifiers,the probability of correctly reconstructing QIr is given by Pr(Xr= QIr), and hence the risk of sensitive attribute disclosure is
3) Randomize QI and S (RR-Both): When data owners apply randomization to both QI and S, the attacker first needs to ensure the values of identifier attributes are correctly reconstructed.The probability is given by Pr(QIr = QIr). Second, the attacker needs to ensure the value of the sensitive attribute,given the correctly reconstructed identifier attribute values are correctly reconstructed. We summarize the risk of sensitive attribute disclosure in RR-Both , as well as RR-S and RR-QI, in the following result and give the general calculation of the attribute disclosure risk in the randomization settings.
RESULT 1. Assume an individual r has quasi-identifier QI_{r} =α = {i_{1}, i_{2}, . . . , i_{m}}, i_{k} ∈ Ω_{k}, and his sensitive attribute S_{r} = u, u∈ Ω_{s}. The probability of successfully predicting the sensitive attribute value S_{r} of the target individual r given his quasi-identifier values QI_{r} is:
We give the formal expressions of the two reconstruction probabilities needed in calculating Pr(S_{r}'QI_{r}) in (6). The reconstruction probability of the quasi-identifier is given by:
where Pr(QI^{~}_{r} = β'QI_{r} = α) =
is the probability that α ={i_{1}, i_{2}, . . . , i_{m}} is distorted to β = {j_{1}, j_{2}, . . . , j_{m}}.
The reconstruction probability of the sensitive attribute S for the target individual with the quasi-identifier QI_{r} is given by:
We are interested in when the attribute disclosure reaches the minimum. We show our results in the following property and include our proof in the Appendix.
PROPERTY 1. Given QI_{r} of individual r:
for RR-S, Pr(S_{r}'QI_{r}) is minimized when
for RR-QI, Pr(S_{r}'QI_{r}) is minimized when
for RR-Both , Pr(S_{r}'QI_{r}) is minimized when
Example. We use the instance shown in Table 1b to illustrate this property. For an individual r with (QI_{r} = Female, S_{r} = Cancer),we randomize S (Disease) with p_{s} and QI (Gender) with p_{G} independently. Fig. 2 shows how the attribute disclosure is varied when we apply different randomization parameters. We can see that Pr(S_{r}'QI_{r}) reaches the maximum (i.e.,
when no randomization is introduced. Fig. 2a shows the scenario when
we only randomize S. We can see that min
when P_{s} = 1/d_{s} = 1/3_{s}
Fig. 2b shows the scenario when we only randomize QI. We can see that min Pr(S_{r}'QI_{r}) = π_{10} = 0.12 when p_{G} = 1/d_{c} = 1/2
Fig. 2c shows the case where randomization is applied to both QI and S. Pr(S_{r}'QI_{r}) reaches the minimum only when both p_{s} =1/3 and p_{G} = 1/2 and min
Computational Cost.The main computation cost in (6)comes from calculating Pr(QI_{r} = QI_{r}). Let P_{QI} be the distortion matrix on quasi-identifiers: P_{QI }= P_{1} ？ · · · ？ P_{m}, π_{QI} be the contingency table on all quasi-identifiers arranged in a column vector,and λ_{QI} denote the expected QI contingency table of the randomized data: λ_{QI} = P_{QI} π_{QI}. Then, the denominator in (7) is exactly the cell of λ_{QI} corresponding to β. Let η denote the column vector of the reconstruction probabilities of quasi-identifiers,arranged in the same order of π_{QI}. We can further express(7) in matrix form:
where × denotes the component-wise multiplication, P_{i}^{2} is the component-wise square of P_{i}, and λ_{QI}^{-1} is the component-wise inverse of λ_{QI}.
In (9), we need to repeatedly calculate (P_{1} ？ · · · ？ P_{m})x,where x denotes a column vector. Assume we use the naive algorithm in all matrix multiplications. Calculating (P_{1} ？ · · · ？ P_{m})x directly results in the time and storage complexity of O(π_{i}d_{i}^{2}). The main storage complexity is from storing matrix P_{QI}. The following lemma allows us to reduce the cost of such computation:
LEMMA 2. Let A, B and X be the matrices of size n × n, m × m,and m × n. Then
(A ？ B) vec(X) = vec(BXA^{T}),
where vec(X) denotes the vectorization of the matrix X formed by stacking the columns of X into a single column vector.
Applying Lemma 2 recursively, we can reduce the time complexity to O([Σ_{i}d_{i}] π_{i}d_{i}). The storage complexity is also reduced to O(π_{i}d_{i} + Σ_{i}d_{i}^{2}); this is mainly used to store the
contingency table.
The ultimate goal of publishing data is to maximize utility,while the minimizing risk of attribute disclosure simultaneously.The utility of any data set, whether randomized or not,is innately dependent on the tasks that one may perform on it.Without a workload context, it is difficult to say whether a data set is useful or not. This motivates us to focus on the data distribution when evaluating the utility of a database, since many data mining applications are based on the probability distribution of the data.
where E[d(.)] denotes the expectation of d(.) and d(X?, π)denotes a certain distance between X and π.
We can set the privacy threshold formalize as the same privacy requirement of l-diversity (i.e., τ = 1/l ) to compare the performance of different disguised schemes. That is, we would determine the optimal randomization parameters (p_{1}, p_{2}, . . . , p_{m},and p_{s}) to maximize the utility, while ensuring that the sensitive value of any individual involved in the data set cannot be correctly inferred by an attacker with probability more than 1/l. A larger l leads to stronger privacy protection. In general, privacy constraints may be flexible. For example, different individuals may have different concerns about their privacy, so we can set different thresholds for Pr(S_{r}'QI_{r}).
Problem 1 is a nonlinear optimization problem. In general,we can solve it using optimization packages (e.g., trust region algorithm [40]). In Section IV-A, we discussed how to efficiently calculate the attribute disclosure of target individuals(shown as constraints in PROBLEM 1). Next, we show how we can efficiently calculate E[∥X-π_{2}^{2}∥], the expected Euclidean distance difference between the original data and the estimated one.
RESULT 2. When d(X, π) is the squared Euclidean distance,Problem 1 is equivalent to: determine p_{i}, i = 1, · · · ,m, and p_{s} to
squared Euclidean distance, with Lemma (3) shown in the Appendix, to minimize d(X, π) is equivalent to minimize trace(Σ) where Σ is the covariance matrix of the estimator of cell values in the contingency table (shown in Equation 2). Calculating trace(Σ) still involves high computational cost. However,when P_{i} has the specific form shown in (1), we can further reduce the problem to minimizing π_{i} ┃P_{i}^{-1}∥_{F}^{2}, and with Lemma 4,we have
where 1 is the column vector whose cells are all equal to 1.
We ran our experiments on the Adult Database from the UCI data mining repository [41] in our evaluations. The same database has been used in previous work on k-anonymity, l-diversity,t-closeness, and anatomy [2-5]. The Adult Database contains 45,222 tuples from US census data and 14 attributes. Table 2 is a summary description of the data including the attributes we used, the number of distinct values for each attribute, and the types of the attributes adopted in the evaluation.
It is expected that a good publication method should preserve both privacy and data utility. We set different l values as privacy disclosure thresholds. We adopt the following standard distance measures to compare the difference of distributions between the original and reconstructed data to quantify the utility. Given two distributions P = (p_{1}, p_{2}, ..., p_{m}), Q = (q_{1}, q_{2}, ..., q_{m}), KL distance is defined as
and χ^{2}-distance is d_{x2}(P, Q)=
In our evaluation, we first investigate utility vs. privacy of the randomization method in protecting attribute disclosure on two aspects: 1) compare data utility among different scenarios,and 2) the impact of cardinality upon data utility. Then, we compare randomization with l-diversity and anatomy.
We treated Education, Martial-status, Gender, and Race as the quasi-identifier and used Workclass as the sensitive attribute. We term this data set EMGRW. We distort only QI attributes, or sensitive attribute S, or both in different application scenarios for randomization. Table 3 shows the derived randomization parameter p for three scenarios (RR-QI, RR-S, and RR-Both). We set l = 2, 3, 4, 5. We can observe that more distortions(p is away from 1) are needed to achieve better privacy protections (when l is increased). We can also observe that p_{i} for QI attributes in RR-Both is generally closer to 1 than that in RR-QI,because RR-Both could also distort the sensitive attribute in addition to those QI attributes to achieve some given privacy protection. Thus, a small magnitude of distortion is needed for QI attributes in RR-Both.
Fig. 3 shows results on various distance measures. Naturally,there is a tradeoff between minimizing utility loss and maximizing privacy protection. Fig. 3 indicates that the smaller the distance values, the smaller the difference between the distorted and the original databases, and the better the utility of the distorted database. We can observe that the utility loss (in terms of distance differences) increases approximately linearly with the increasing privacy protection across all three randomization scenarios. RR-Both achieves the best in terms of utility preservation,because we use the optimal randomization parameters for both QI and the sensitive attribute.
As discussed in Section III, one advantage of the randomization scheme is that the more data we have, the more accurate reconstruction we can achieve. We generate four more data sets with varied sizes by sampling r * N tuples from the Adult Data randomly where N is the size of the original Adult data set and we set r ∈ [0.5, 1.5, 2, 2.5] to investigate the impact of data size upon the data utility of the randomized data. All four generated data sets have the exact same distribution as the original one.Fig. 4 shows the accuracy of reconstructed data distribution when data size increases. We see that data utility is further improved when more data are available.
We chose Education, Salary, Gender, Race as QI, and Occupation as the sensitive attribute, similar to the settings of empirical evaluations in [2, 5] to compare randomization scheme with l-diversity and anatomy. We term this data set ESGRO. We did not use the previous EMGRW data set, because l-diverse partitions cannot be derived by the anatomy algorithm or entropy ldiversity.As specified in Machanavajjhala et al. [2] and Xiao and Tao [5], an l-diverse partition exists, if and only if at most N/l records are associated with the same sensitive value, where N is the cardinality of the data set. The data distribution of attribute Work-class is much skewed and the frequency of one value is much larger than the others. In this section, we focus on the use of RR-QI to compare to those two group based models:
l-diversity and anatomy, because the overall distribution of sensitive attribute values is unchanged in both group based methods after anonymization and RR-QI after randomization.
Generalization approaches usually measure the utility syntactically by the number of generalization steps applied to quasi-identifiers [42], average size of quasi-identifier equivalence classes [2], sum of squares of class sizes, or preservation of marginals [3]. We further examine the utility preservation from the perspective of answering aggregate queries, in addition to the previous distribution distance measures, since the randomization scheme is not based on generalization or suppression.We adopt query answering accuracy; the same metric is used in Zhang et al. [4]. We also consider the variation of correlations between the sensitive attribute and quasi-identifiers. We used an implementation of the Incognito algorithm [3] to generate the entropy l-diverse tables and used the anatomy algorithm in [5] in our experiments.
1) Distribution Distance: We compare data utility of reconstructed probability distributions for different models according to the distance measures. Fig. 5 shows distances between X and π for anatomy, l-diversity and RR-QI on the data set ESGRO.We can observe that randomization outperforms both anatomy and l-diversity methods, because we can partially recover the original data distribution in the randomization scheme, whereas data distribution within each generalized equivalence class is lost in l-diversity generalization and the relations between the quasi-identifier table (QIT) and the sensitive table (ST) are also lost in anatomy.
Another observation is that data utility (in terms of distance between original and reconstructed distributions) monotonically decreases with the increment of the privacy thresholds (l) for randomization and anatomy. This is naturally expected, because more randomness needs to be introduced with the increment of the privacy requirements for randomization and larger l in anatomy means that more tuples are included in each group, which decreases the accuracy of the estimate for the distribution of the original data. However, there is no similar monotonic trend in l-diversity,because the generalization algorithm chooses different attributes for generalization with various l values; this makes the accuracy of the estimated distribution vary to different extents.
2) Query Answering Accuracy:The accuracy of answering aggregate queries is one of the important aspects to evaluate the utility of distorted data. We compare randomization against two other anonymization approaches [3, 5] using the average relative error in the returned count values. For each count value(corresponding to the number of records in a group), its relative error is defined as |act - est|/act, where act is the actual count value from the original data, and est is the estimate from the reconstructed data for RR approaches or the anonymized data for anonymization approaches. We consider two types of queries in our evaluation.
- Base group-by query with the form:SELECT A1,. . . , Am, S, COUNT(*) FROM data
WHERE A1 = i1 . . . Am = im AND S = is GROUP BY (A1, · · · ,
Am, S)
Where ik ∈ Ωk and is ∈ Ωs.
- Cube query with the form:
SELECT A1,· · · , Am, S, COUNT(*) FROM data
GROUP BY CUBE (A1, · · · , Am, S)
The base group-by query returns the number of records in each group of (A_{1} = i_{1} . . . A_{m} = i_{m} and S = i_{s}). Note that the total number of groups is equal D = d_{s} Π_{i-1}^{m} d_{i}).
After running the base group-by query on the original data and the reconstructed data or the anonymized data, we get the count values of act and est, respectively. Fig. 6a shows the calculated average relative errors of three methods (anatomy, l-diversity,and RR-QI) with varied l (l = 1, · · · , 7). We use the CUBE query to describe all the possible hierarchical aggregate queries (In multidimensional jargon, a cube is a cross-tabulated summary of detail rows. CUBE enables a SELECT statement to calculate subtotals for all possible combinations of a group of dimensions. It also calculates a grand total.). The CUBE query returns aggregate values of all possible combinations of QI attributes. We calculate the average relative error for each l = 1,· · · , 7 and show the results in Fig. 6b. We can observe from Fig. 6a and 6b that randomization permits significantly more accurate aggregate analysis than both l-diversity and anatomy, since it can recover more accurate data distribution. Conversely, l-diversity loses the data distribution of records within each generalized QI-group and anatomy loses correlations between the QI attributes and the sensitive attribute S.
3) Correlation Between Attributes: A good publishing method should also preserve data correlation (especially between QI and sensitive attributes). We use Uncertainty Coefficient(U) to evaluate the correlation between two multi-category variables:
The uncertainty coefficient takes values between -1 and 1;larger values represent a strong association between variables.When the response variable has several possible categorizations,these measures tend to take smaller values, as the number of categories increases.
Table. 4 and 5 show correlation (uncertainty coefficient) val-
ues between each pair of attributes vary under three models(anatomy, l-diversity and RR-QI) on the data set ESGRO. We vary l from 2 to 7. We only include correlation values for l = 3,4, and 5 due to the space limitation. We use the attribute pair of Salary (S, one QI attribute) and Occupation (O, the sensitive attribute) as an example (the column with bold fonts in Table 5)to study how correlations between QI and S change. The original uncertainty coefficient is 2.74 × 10^{-2}. RR-QI well achieves correlation preservation, i.e., 2.41, 2.27, and 2.17 (×10^{-2}) for l =3, 4, and 5 respectively. Conversely, the uncertainty coefficient value under anatomy is only 0.20, 0.12, and 0.05 (×10^{-2}) correspondingly.For l-diversity, it achieves zero correlation preservation when l = 3, and 5 while it perfectly achieves correlation preservation when l = 4, because the Salary attribute is generalized to “All” when l = 3, and 5, while it is unchanged across all QI-groups when l = 4. l-diversity in general cannot preserve correlation well, because it is intractable to predict which QI attributes will be generalized in l-diversity.
Fig. 7 shows the average value of uncertainty coefficients among attributes for anatomy, l-diversity and RR-Both on the data set ESGRO. As shown in Fig. 7a, randomization achieves the best correlation preservation between the sensitive attribute and quasiidentifiers across all privacy thresholds. Fig. 7b also clearly shows that randomization better preserves correlation among quasi-identifiers than l-diversity does. Please note that anatomy can best achieve the correlation among quasi-identifiers,since it does not change values of quasi-identifiers.
In summary, the evaluation shows that randomization can better preserve utility (in terms of distribution reconstruction,accuracy of aggregate query answering, and correlation among attributes) under the same privacy requirements. Utility loss is significantly smaller than that of generalization or anatomy approaches. Furthermore, the effectiveness of randomization can be further improved when more data are available. The evaluation also showed that the randomization approach can further improve the accuracy of the reconstructed distribution(and hence utility preservation) when more data are available,while generalization and anatomy approaches do not have this property.
In this paper, we investigated attribute disclosure in the case of linking attacks. We compared randomization to other anonymization schemes (l-diversity and anatomy) in terms of utility preservation, with the same privacy protection requirements.Our experimental evaluations showed randomization significantly outperforms generalization, i.e., achieving better utility preservation, while yielding the same privacy protection.
There are several other avenues for future work. We aim to extend our research to handle multiple sensitive attributes. In this paper, we limit our scope, as attackers have no knowledge about the sensitive attribute of specific individuals in the population and/or the table. In practice, this may not be true, since the attacker may have instance-level background knowledge(e.g., the attacker might know that the target individual does not have cancer; or the attacker might know complete information about some people in the table other than the target individual.)or partial demographic background knowledge about the distribution of sensitive and insensitive attributes in the population.Different forms of background knowledge have been studied in privacy preservation data publishing recently. For example, the formal language [22, 24] and maximum estimation constraints[43] are proposed to express background knowledge of attackers.We will investigate privacy disclosure under those background knowledge attacks. We should note that randomization may not outperform generalization in all cases, especially when some specific background knowledge is utilized in the adversary model, because generalization is truthful and the ambiguity is between true tuples (e.g., individual a and b are indistinguishable),whereas randomization can be regarded as adding noise,and noise can be removed when some background knowledge is known.
We will continue our study of further reducing computation overhead of the randomization approach. The execution time of randomization is usually slower than l-diversity and anatomy.For example, the execution time of our RR-Both approach on the Adult data sets was 10 times slower than for generalization approaches due to the heavy cost of determining the optimal distortion parameters with attribute disclosure constraints, especially when the attribute domains are large. One advantage of randomization is that the reconstruction accuracy increases when more data are available. Guo et al. [10] preliminarily conducted theoretical analysis on how the randomization process affects the accuracy of various measures (e.g., support, confidence,and lift) in market basket data analysis. In our future work, we will study the accuracy of reconstruction in terms of bias and variance of estimates in randomizing general microdata.
Finally, we are interested in comparing our randomization approaches to most recently developed generalization approaches(e.g., the accuracy-constrained l-diversification method in [14]).We are also interested in combining the derived optimal randomization schemes with the small domain randomization [38]to further improve utility preservation. It is our belief that extensive studies on comparing different privacy preservation data publishing approaches are crucial.
We start with applying RR to the sensitive attribute, Pr( r =QI^{^}_{r}) = 1. Without loss of generality, we assume that S_{r} = 1. Combine other categories 0, 2, . . . , d_{s} - 1 into a new categories, and still use 0 to denote the new category. To make the notation simple,we simply write
as π_{0} in this proof, then π_{1} =1 ？ π_{0}. Such adjustment does not change the reconstruction probability Pr( S？_{r} = S_{r} = 1'QI_{r}). After the adjustment, the randomization probabilities are given by
By definition, the posterior probabilities are given by
Combining (13), (14), and (15), we have
Taking the derivative with respect to ps, we have (16) is minimized when p_{s} = 1/d_{s} , and the minimal value is
Following similar strategies, we can prove the general case when we randomize both QI and S.
LEMMA 3. If d( π^{^} , π) = '' π^{^} π''_{2}^{2} , we have E[d( π^{^}, π)] = trace(Σ), where Σ is the covariance matrix of π^{^} shown in (2).
PROOF. We know that π_{^} asymptotically follows the normal distribution N(π, Σ). Let Σ = XΛX^{T} be the eigen-decomposition of Σ, where Λ = diag(λ_{1}, . . . , λ_{n}) and X^{T}X = I. Let η = X^{T} (π^{^} π), then η is normally distributed with E(η) = 0 and
Cov(η) = X^{T} Cov( π^{^} π)X = X^{T}ΣX = Λ.
Notice that Λ is a diagonal matrix, and hence Cov(η_{i}η_{j}) = 0 if i ≠_{ j}, and Var(η_{i}) = λ_{i}, i.e., η_{i} independently follows the normal distribution N(0, λ_{i}). Therefore, we have:
The last equality is due to the fact that
trace(Σ) = trace(XΛXT) = trace(ΛXTX) = trace(Λ).
We proved the lemma.
LEMMA 4. Let P_{i }be the randomization matrix specified in(1). When p_{i} ≠ 1/d_{i} , we have
where 1 is the column vector, whose cells all equal 1.
PROOF. We can re-write Pi as follows:
With
and the binomial inverse theorem[45], we can immediately get P_{i}^{-1}, and ''P_{i}^{-1 }_{F} ''^{2} can be directly calculated from P_{i}^{-1} .
LEMMA 5. Let (P^{-1})_{i} denote the i-th column (or row) of P^{-1}.Then, for any i = 1, 2, . . . , D, we have
PROOF. With Lemma 4, we observe that every row (or column)of P_{i}^{-1}has the same components, except for a change of order. Since P^{-1} = P_{i}^{-1} ？ · · · ？ P_{s}^{-1} , we can also conclude that all rows (or columns) of P^{-1} have the same components,except for a change of order. Therefore, for all i’s,
Next, we prove the main result. With Lemma 3, when the distance is the squared Euclidean distance, to minimize E[d(π？ ,π)] is equivalent to minimize trace(Σ). With (2), we have
Notice that the constraint function is an increasing function of pi, the optimal solution must occur when the equality stands,and we have proved the result.