Limiting Attribute Disclosure in Randomization Based Microdata Release
- Author: Guo Ling, Ying Xiaowei, Wu Xintao
- Organization: Guo Ling; Ying Xiaowei; Wu Xintao
- Publish: Journal of Computing Science and Engineering Volume 5, Issue3, p169~182, 30 June 2011
-
ABSTRACT
Privacy preserving microdata publication has received wide attention. In this paper, we investigate the randomization approach and focus on attribute disclosure under linking attacks. We give efficient solutions to determine optimal distortion parameters, such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions,accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.
-
KEYWORD
Algorithms , Privacy preservation , Randomization , Attribute disclosure , Linking atack
-
Privacy preserving microdata publication has received wide attention [1-5]. Each record corresponding to one individual has a number of attributes, which can be divided into the following three categories: 1)
identity attributes (e.g., Name and SSN)whose values can uniquely identify an individual; 2)quasiidentifier(QI) attributes (e.g., demographic attributes such as ZIP code, age, and gender) whose values when taken together can potentially identify an individual; 3)sensitive attributes(e.g., disease and income) that indicate confidential information of individuals.Before microdata are released, identity attributes are often directly removed to preserve privacy of individuals whose data are in the released table. However, the
QI information may be used by attackers to link to other public data sets to get the private information of individuals. This is recognized as linking attacks in microdata publishing. Two types of information disclosures have been identified under linking attacks: identity disclosure and attribute disclosure [6]. Identity disclosure occurs if attackers can identify an individual from the released data.Attribute disclosure occurs when confidential information about an individual is revealed and can be attributed to the individual.Samarati and Sweeney [1] proposed thek -anonymity model and presented a generalization approach that divides tuples intoQI -group by transforming theirQI -values into less specific forms,such that tuples in the sameQI -group cannot be uniquely identified by attackers to counter linking attacks based on quasi-identifiers.It was identified thatk -anonymity is vulnerable to homogeneity and background knowledge attacks when data in theQI -group lacks diversity in the sensitive attributes [2].ldiversity [2], as well as following models (e.g.,t -closeness [7]),were proposed to protect attribute disclosure.l -diversity requires that the sensitive attribute has at least one well-represented values for eachQI -group in the generalized dataset.The randomization approach has also been adopted to publish microdata [8-12]. Instead of generalizing
QI attribute values,randomization approach distorts the original value to another domain value according to some distortion probabilities.The application of the randomization technique was studied to prevent identity disclosure under linking attacks in data publishing [12]. They focused on evaluating the risk of successfully linking a target individual to the index of his record given values ofQI attributes. Our research moves one step further to investigate attribute disclosure under linking attacks. We focus on evaluating the risk of successfully predicting the sensitive attribute value of a target individual given hisQI attribute values.We present a general randomization framework and give efficient solutions to determine optimal randomization parameters for bothQI and sensitive attributes. Thus, we can maximize data utility, while satisfying privacy preservation requirements for sensitive attributes. We compare our randomization approach with other anonymization approaches within the framework(e.g., two representative approaches l-diversity [2] and anatomy[5] are used in this paper). Our result shows the randomization approach can better recover the distribution of the original data set from the disguised one. Thus, intuitively, it might yield a disguised database with higher data utility thanl -diversity generalization and anatomy.Our contributions are summarized as follows. We present a systematic study of the randomization method in preventing attribute disclosure under linking attacks. We propose the use of a specific randomization model and present an efficient solution to derive distortion parameters to satisfy requirements for privacy preservation, while maximizing data utilities. We propose a general framework and present a uniform definition for attribute disclosure which is compatible for both randomization and generalization models. We compare our randomization approach with
l -diversity and anatomy in terms of utility preservation(under the same privacy requirements) from three aspects(reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.The remainder of this paper is organized as follows. In Section II, we discuss closely related work on group based anonymization approaches and randomization approaches in privacy preservation data publishing. In Section III, we present background on randomization based distortions, including analysis and attacks on the randomized data. In Section IV, we first quantify attribute disclosure under linking attacks and then show our theoretical results on maximizing utility with privacy constraints. In Section V, we conduct empirical evaluations and compare the randomization based distortion with two representative group based anonymization approaches (
l -diversity [2]and anatomy [5]). We conclude our work in Section VI.> A. Group Based Anonymization
k -anonymity was proposed [1] to counter linking attacks for data publishing. The method generalizes the values of quasiidentifier attributes to less-specific ones, so that each individual cannot be distinguished from at leastk - 1 other individuals based on quasi-identifier information. There has been much study in designing efficient algorithms fork -anonymity using generalization and suppression techniques [2, 3, 5, 13, 14].k -anonymity provides protection against identity disclosure,while it contributes little to attribute disclosure.l -diversity [2] is proposed to counter attribute disclosure. A table isl -diverse if,in eachQI -group, at most 1/l of the tuples possesses the most frequent sensitive value. Thus, an individual can be linked to his sensitive value correctly with probability at most 1/l . Similarly,the notation oft -closeness [7] requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a thresholdt ). The anatomy method was proposed to protect attribute disclosure by breaking the link between quasi-identifiers and sensitive attributes [5]. Anatomy releases all the quasi-identifier and sensitive values directly in two separate tables: aquasiidentifier table (QIT) and a sensitive table (ST). They presented an algorithm to compute anatomized tables that satisfy thel -diversity requirement, while minimizing the error of reconstructing the data.Kifer and Gehrke [15] investigated the problem of injecting additional information into k-anonymous and
l -diverse tables to improve the utility of published data. Koudas et al. [16] studied the problem of preserving privacy through sensitive attribute permutation and generalization with the addition of fake sensitive attribute values. Narayanan and Shmatikov [17] presented a de-anonymization methodology for sparse multi-dimensional microdata, such as individual transactions, preferences, and so on. The de-anonymization algorithm could be used by attackers against data sets containing anonymous high-dimensional data.Brickell and Shmatikov [18] compared the privacy loss (defined by certain kinds of information learned by attackers) with the utility gain (defined as the same kinds of information learned by analysts) caused by data anonymization and concluded that“even modest privacy gains require almost complete destruction of the data mining utility” in generalization methods. Li and Li[19] proposed the use of theworst-case privacy loss rather than theaverage privacy loss among all individuals adopted in [18]to measure privacy loss. They further presented a methodology to evaluate the privacy-utility tradeoff, where privacy loss is quantified by the adversary’s knowledge gain about the sensitive values of specific individuals with the trivially-anonymized data as the baseline and utility loss is measured by the information loss about the sensitive values of large populations with the original data as the baseline. In our paper, we use 1/l to bound the attribute disclosure for all records and use the difference of distributions between the original data and the reconstructed data to quantify utility loss. We also examine the utility loss from the perspective of answering aggregate queries and preserving correlations between the sensitive attribute andQI attributes.In this paper, we focus on the randomization model in resisting attribute disclosure of data publishing. The proposed randomization model is compared to traditional
l -diversity [2] and anatomy [5] models. Recently, Ghinita et al. [14] proposed a framework to solve the privacy-constrained and accuracy constrained data anonymization problems under thek -anonymity andl -diversity models. They developed efficient heuristics to solve the one-dimensional problems in linear time and generalized solutions to multi-dimensional attributes using space-mapping techniques. Their empirical evaluation results showed that privacy/accuracy-constrained methods outperform existing generalization approaches in terms of the execution time and information loss. We would also point out that researchers have developed various anonymity models, includingp -sensitivek -anonymity[20], (α, k )-anonymity [21], (k, e )-anonymity [4], (c,k )-safety [22],m -confidentiality [23], privacy skyline [24],δ -presence [25], (ε, m )-anonymity [26], and (t, λ )-uniqueness[27]. Detailed comparisons to these works are beyond the scope of this paper.The first randomization model, called Randomized Response(RR), was proposed by Warner [28] in 1965. The model deals with one dichotomous attribute, i.e., every person in the population belongs either to a sensitive group
A , or to its complement? . The problem is to estimateπA , the unknown proportion of population members in groupA . Each respondent is provided with a randomization device by which the respondent chooses one of the following two questions Do you belong toA or Do you belong to? with respective probabilitiesp and 1 -p and then repliesyes orno to the question chosen. The technique provides response confidentiality and increases respondents’ willingness to answer sensitive questions, because no one but the respondent knows to which question the answer pertains. It has been extended to the field of privacy preserving data mining[29].Aggarwal and Yu [30] provided a survey of randomization models for privacy preservation data mining, including various reconstruction methods for randomization, adversarial attacks,optimality and utility of randomization. Rizvi and Haritsa [9]developed the MASK technique to preserve privacy for frequent item set mining. Agrawal and Haritsa [31] presented a general framework of random perturbation in privacy preservation data mining. Du and Zhan [8] studied the use of a randomized response technique to build decision tree classifiers. Guo et al. [10] investigated data utility in terms of the accuracy of reconstructed measures in privacy preservation market basket data analysis. Aggarwal [32] proposed adding noise from a specified distribution to produce a probabilistic model of k-anonymity.Zhu and Liu [33] investigated the construction of optimal randomization schemes for privacy preservation density estimation and proposed a general framework for randomization using mixture models. Rebollo-Monedero et al. [34]defined the privacy measure similar to t-closeness [7] and formalized the privacy-utility tradeoff from the information theory perspective. The resulting solution turns out to be the postrandomization(PRAM) method [35] in the discrete case and a form of random noise addition in the general case. They also proved that the optimal perturbation is in general randomized,rather than deterministic in the continuous case.
Huang and Du [36] studied the search of optimal distortion parameters to balance privacy and utility. They developed an evolutionary multi-objective optimization method to find optimal distortion matrices when applying the RR technique on a single attribute. The complexity of the algorithm is
O (((NQ +NV )3 +n2 )L ) whereNQ is the population size, NV is the archive size,n is the number of columns in the RR matrix, andL is the maximum number of iterations. However, there is no guarantee that the derived matrices are optimal in the entire search space and the performance tends to be sacrificed for multiple attributes due to the intractable size of the search space. Similarly,Xiao et al. [37] investigated the optimal random perturbation at multiple privacy levels.In our paper, we focus on a specific randomization model(i.e., perturbation retains the original value of an attribute
Ai with probabilitypi and replaces the original value with a random value from the domain ofAi , see Equation 1) and derive an efficient solution to determine the optimal randomization parameters under linking attacks. Chaytor and Wang [38] also applied the use of this simple randomization model to randomize a sensitive value only within a subset of the entire domain.The developed algorithm, calledSmall Domain Randomization ,can effectively derive optimal partitions, such that randomizing the sensitive attribute values of records (with small domains) in each partition separately can retain more data utility with the same level of privacy than randomizing the whole data set over the entire domain. Our work focuses on attribute disclosure when applying full domain randomization on bothQI attributes and sensitive attributeS . The derived optimal randomization parameter values under privacy constraints can be incorporated in the small domain randomization.Dataset
T containsN records andm + 1 categorical attributes:A1 ,A2 , . . . ,Am , andS . We useQI = {A1 , . . . ,Am } to denote the set of quasi-identifier attributes (e.g., demographic) whose values may be known to the attacker for a given individual and useS to denote one sensitive attribute whose value should not be associated with an individual by attackers. Generally,T may also contain other attributes that are neither sensitive nor quasiidentifying.Those attributes are usually kept unchanged in the released data. We exclude them from our setting, since they do not incur privacy disclosure risk or utility loss. All of the discussions in this paper are also explained in the single sensitive attribute setting and can be generalized to multiple sensitive attributes.Attribute
Ai hasdi categories denoted by 0,1,. . . ,di - 1. We use Ωi to denote the domain ofAi (S ), Ωi = {0, 1, . . . ,di - 1},and ΩQI = Ω1 × · · · × Ωm is the domain of quasi-identifiers. Similarly,attributeS hasds categories denoted by 0,1,. . . ,ds - 1. We use Ωs = {0, 1, . . . ,ds - 1} to denote the domain ofS . Ther -th recordRr is denoted by (A1r ,A2r , . . . ,Amr ,Sr ) or simply (QIr ,Sr ).LetD = ds Πmi-1di be the total number of cells in the contingency table.Let πi1, … , i m, isdenote the true proportion corresponding to the categorical combination (A1 = i1, · · · , Am = im, S = is). Let πi1be the column vector with
D elements πil, · · ·, im, is arranged in a fixed order. Table 1 shows one contingency table example for a data set with oneQI attribute (Gender, d1 = 2) and one sensitive attribute S (Disease, ds = 3). Table 1a shows the original contingency table where π = (π00, π01, π02, π10, π11, π12)' corresponds to a fixed order of cell entries πij in the 2 × 3 contingency table. π10 denotes the proportion of records with Female andCancer . The column sum π+0 represents the proportion of records withCancer across both genders. Note that contingency tables are widely used in statistics to record and analyze the relationship betweencategorical attributes. Results of most data mining tasks (e.g.,clustering, decision tree learning, and association rule mining),as well as aggregate queries, are solely determined by the contingency table. That is, analysis on contingency tables is equivalent to analysis on the original categorical data. Table 1b shows one contingency table instance derived from the data set with 100 tuples. We will use this contingency table as an example to illustrate properties of link disclosure.
We use the left part of Fig. 1 to illustrate the process of privacy preserving data publishing. For each record
Rj , the data owner independently randomizes attributeAi using the distortion matrixPi . Specifically, for attributeAi (orS ) withdi categories,the randomization process is to change a record belonging to the v-th category (v = 0, . . . ,di - 1) to the u-th category with probabilitypuv(i) : Pr(Ai =u 'Ai =v ) =puv(i) . Letpi = [puv(i) ]di×di , and we callPi (orPs ) the distortion matrix forAi (or S). Naturally,the column sums ofPi are equal to 1. The original databaseT is changed toT , and then both the randomized data set and the distortion matrices are published. The randomization matrices indicate the magnitude of the randomization; this can help data analysts estimate the original data distribution.Let λ denote the contingency table of the randomized data
T . We arrange λ into the column vector with the same order of π.. Table 1c shows one example of the randomized contingency table. The randomized contingency table has a close relationship to the original contingency table and the randomization matrices: E(λ) = P π where P = P1 ? · · · ? Pm ? Ps, and ?stands for the Kronecker product. The Kronecker product is an operation on two matrices, anm -by-n matrixA and ap -by-q matrixB , resulting in the mp-by-nq block matrix.Distortion matrices determine the privacy and utility of the randomized data. How to find optimal distortion parameters with privacy or utility constraints has remained a challenging problem [39]. In this paper, we limit the choice of randomization parameters for each
QI attributeAi (and sensitive attribute S) as:That is, for each attribute
Ai , all categories have the same probabilitypi to remain unchanged, and have the same probabilityqi to be distorted to a different category. With this choice limit, we can derive an efficient algorithm (with explicit formula)to determine the optimal randomization parameters (as shown in Section IV-B).> B. Analysis on the Randomized Data
One advantage of randomization procedure is that the pure random process allows data analysts to estimate the original data distribution based on the released data and the randomization parameters. The right part of Fig. 1 shows how data analysts estimate the original data distribution. With the randomized data
T and its contingency table λ, the unbiased estimate of π is given by π = P-1λ and the covariance matrix of the estimator iswhere (Pπ)δ stands for the diagonal matrix, whose diagonal values are Pπ.
Thus, data analysts derive the estimation for the distribution of original data (in terms of contingency table) without disclosing the individual information of each record. We choose the accuracy of reconstructed distribution as the target utility function in Section IV-B, because most data mining applications are based on the probability distribution of the data.
> C. Attacks on Randomized Data
Let
X be an attribute or a subset of attributes in the data setT with domain ΩX, and X is the randomized value of X inT . It is not reasonable for attackers to regard the observed value as the true value of X, with the randomization process and parameters.Instead, attackers can try to estimate the original value based on the observed data and the released randomization parameters.Let X denote the attackers’ estimation on the original value ofX . Any value in ΩX is possible due to the randomization procedure.We assume that the attacker is able to calculate the posterior probability of content in the data set and takes the following probabilistic strategy: for anyμ ,ν ∈ ΩX,where Pr(X =
μ ' X=ν ) denotes the attacker’s posterior belief on the original valueX =μ when he observes X =ν . With the Bayes’ theorem, it can be calculated by:The following lemma gives the accuracy of the attacker’s estimation.
LEMMA 1.
Suppose attackers adopt the probabilistic strategy specified in (3)to estimate the data. The probability that attackers accurately estimate the original value of X is given by :PROOF. For a particular observed value X =
ν ∈ ΩX,Pr(X =X =μ , X = ν) = Pr( = ν'X = μ) Pr(X = μ' =ν ). Then, with the law of total probability, we haveThe probability of attackers’ correct estimation is also defined as the reconstruction probability, Pr(
X =μ →X =μ ), in [9, 12].It captures the round-trip of going from the original content to the distorted one and then returning to estimate the value of the original content. That is, it indicates how much information of the original data is preserved after randomization. To make the notation concise, we adopt the notation Pr(X=X =μ ) to denote the reconstruction probability to evaluate the risk of the sensitive attribute disclosure.IV. ATTRIBUTE DISCLOSURE UNDER LINKING ATTACKS
We measure privacy in terms of the attribute disclosure risk,whose formal definition is given, as follows:
DEFINITION 1.
The attribute disclosure risk under linking attacks is defined to be the probability that the attacker predicts Sr successfully given QIr of a target individual r, denoted as Pr(Sr'QIr). We need to quantify the background knowledge of attackers to derive the disclosure probability Pr(
Sr 'QIr ). We have the following standard assumptions for background knowledge of attackers in this paper. We assume that the attacker has access to the published data set T and he knows that T is a randomized version of some base tableT . The attacker knows the domain of each attribute ofT . We also assume that the attacker can obtain theQI -values of the target individual (e.g., Alice) from some public database or background knowledge and knows that the target individual is definitely contained in the published data.However, he has no knowledge of which record in the published data belongs to the target individual. Finally, we assume that the distortion matricesPi are available to the attacker, because they are necessary for data miners to conduct analysis. The algorithm ofl -diversity in [2] preserves privacy by generalizing theQI attributes to formQI -groups. Individuals in the group are linked to any sensitive attributes with probability at most 1/l , i.e.,Pr(Sr 'QIr ) ≤ 1/l . However, the randomization based approach achieves the privacy protection probabilistically. In the following subsection, we show how to quantify the attribute disclosure risk in the randomization settings.> A. Quantifying Attribute Disclosure
When there is no randomization applied, for those records with their quasi-identifiers equal to
QIr , the attacker simply regards every record as having the same probability to represent individualr . The risk of sensitive attribute disclosure is equal tobecause there are πQIr records within the group of
QIr , and πQIr, Sr of them have the sensitive value equal toSr . This case corresponds to the worst case of the attribute disclosure risk. When randomization is applied, the attribute disclosure risk will be reduced, because the randomization increases the attacker’s uncertainty.1) Randomize S only (RR-S): When data owners only apply randomization to the sensitive attribute, for each record within the group of QIr, the attacker takes a guess on its sensitive value using the observed sensitive value and the posterior probability in (4). According to (5), the probability of a correct estimation is Pr( = ?r'QIr), then the risk of sensitive attribute disclosure is
2) Randomize QI only (RR-QI): Similarly as RR-S, when data owners only apply randomization to the quasi-identifiers,the probability of correctly reconstructing QIr is given by Pr(Xr= QIr), and hence the risk of sensitive attribute disclosure is
3) Randomize QI and S (RR-Both): When data owners apply randomization to both QI and S, the attacker first needs to ensure the values of identifier attributes are correctly reconstructed.The probability is given by Pr(QIr = QIr). Second, the attacker needs to ensure the value of the sensitive attribute,given the correctly reconstructed identifier attribute values are correctly reconstructed. We summarize the risk of sensitive attribute disclosure in RR-Both , as well as RR-S and RR-QI, in the following result and give the general calculation of the attribute disclosure risk in the randomization settings.
RESULT 1. Assume an individual r has quasi-identifier QIr =α = {i1, i2, . . . , im}, ik ∈ Ωk, and his sensitive attribute Sr = u, u∈ Ωs. The probability of successfully predicting the sensitive attribute value Sr of the target individual r given his quasi-identifier values QIr is: We give the formal expressions of the two reconstruction probabilities needed in calculating Pr(
Sr 'QIr ) in (6). The reconstruction probability of the quasi-identifier is given by:where Pr(QI~r =
β 'QIr =α ) =is the probability that α ={
i1 ,i2 , . . . ,im } is distorted toβ = {j1 ,j2 , . . . ,jm }.The reconstruction probability of the sensitive attribute
S for the target individual with the quasi-identifierQIr is given by:We are interested in when the attribute disclosure reaches the minimum. We show our results in the following property and include our proof in the Appendix.
PROPERTY 1. Given QIr of individual r: for RR-S, Pr(Sr'QIr) is minimized when for RR-QI, Pr(Sr'QIr) is minimized when for RR-
Both , Pr(Sr'QIr) is minimized when Example. We use the instance shown in Table 1b to illustrate this property. For an individual r with (
QIr = Female, Sr = Cancer ),we randomize S (Disease) withps andQI (Gender) withpG independently. Fig. 2 shows how the attribute disclosure is varied when we apply different randomization parameters. We can see that Pr(Sr 'QIr ) reaches the maximum (i.e.,when no randomization is introduced. Fig. 2a shows the scenario when
we only randomize S. We can see that min
when Ps = 1/ds = 1/3s
Fig. 2b shows the scenario when we only randomize
QI . We can see that min Pr(Sr 'QIr ) = π10 = 0.12 when pG = 1/dc = 1/2Fig. 2c shows the case where randomization is applied to both
QI and S. Pr(Sr 'QIr ) reaches the minimum only when bothps =1/3 and pG = 1/2 and minComputational Cost. The main computation cost in (6)comes from calculating Pr(QI r =QIr ). LetPQI be the distortion matrix on quasi-identifiers:PQI = P1 ? · · · ? Pm, πQI be the contingency table on all quasi-identifiers arranged in a column vector,and λQI denote the expectedQI contingency table of the randomized data: λQI =PQI πQI . Then, the denominator in (7) is exactly the cell ofλQI corresponding toβ . Letη denote the column vector of the reconstruction probabilities of quasi-identifiers,arranged in the same order ofπQI . We can further express(7) in matrix form:where × denotes the component-wise multiplication, P
i 2 is the component-wise square of Pi , andλQI-1 is the component-wise inverse ofλQI .In (9), we need to repeatedly calculate (P1 ? · · · ? Pm)
,wherex denotes a column vector. Assume we use the naive algorithm in all matrix multiplications. Calculating (P1 ? · · · ? Pm)x directly results in the time and storage complexity ofx O (πi di2 ). The main storage complexity is from storing matrixPQI . The following lemma allows us to reduce the cost of such computation:LEMMA 2.
Let A, B and X be the matrices of size n × n, m × m,and m × n. Then (
A ? B ) vec(X ) = vec(BXAT ),where vec(X) denotes the vectorization of the matrix X formed by stacking the columns of X into a single column vector. Applying Lemma 2 recursively, we can reduce the time complexity to
O([Σidi] πidi) . The storage complexity is also reduced toO(πidi + Σidi2) ; this is mainly used to store thecontingency table.
> B. Maximizing Utility with Privacy Constraints
The ultimate goal of publishing data is to maximize utility,while the minimizing risk of attribute disclosure simultaneously.The utility of any data set, whether randomized or not,is innately dependent on the tasks that one may perform on it.Without a workload context, it is difficult to say whether a data set is useful or not. This motivates us to focus on the data distribution when evaluating the utility of a database, since many data mining applications are based on the probability distribution of the data.
where E[d(.)] denotes the expectation of d(.) and d(X?, π)denotes a certain distance between X and π. We can set the privacy threshold formalize as the same privacy requirement of l-diversity (i.e., τ = 1/
l ) to compare the performance of different disguised schemes. That is, we would determine the optimal randomization parameters (p1, p2, . . . , pm,and ps ) to maximize the utility, while ensuring that the sensitive value of any individual involved in the data set cannot be correctly inferred by an attacker with probability more than 1/l . A larger l leads to stronger privacy protection. In general, privacy constraints may be flexible. For example, different individuals may have different concerns about their privacy, so we can set different thresholds for Pr(Sr 'QIr ).Problem 1 is a nonlinear optimization problem. In general,we can solve it using optimization packages (e.g., trust region algorithm [40]). In Section IV-A, we discussed how to efficiently calculate the attribute disclosure of target individuals(shown as constraints in PROBLEM 1). Next, we show how we can efficiently calculate
E [∥X-π22∥], the expected Euclidean distance difference between the original data and the estimated one.RESULT 2.
When d(X, π) is the squared Euclidean distance,Problem 1 is equivalent to: determine pi, i = 1, · · · ,m, and ps to squared Euclidean distance, with Lemma (3) shown in the Appendix, to minimize d(X, π) is equivalent to minimize trace(Σ) where Σ is the covariance matrix of the estimator of cell values in the contingency table (shown in Equation 2). Calculating trace(Σ) still involves high computational cost. However,when
Pi has the specific form shown in (1), we can further reduce the problem to minimizing πi ┃Pi-1∥F2, and with Lemma 4,we havewhere 1 is the column vector whose cells are all equal to 1.
We ran our experiments on the Adult Database from the UCI data mining repository [41] in our evaluations. The same database has been used in previous work on
k -anonymity,l -diversity,t -closeness, and anatomy [2-5]. The Adult Database contains 45,222 tuples from US census data and 14 attributes. Table 2 is a summary description of the data including the attributes we used, the number of distinct values for each attribute, and the types of the attributes adopted in the evaluation.It is expected that a good publication method should preserve both privacy and data utility. We set different l values as privacy disclosure thresholds. We adopt the following standard distance measures to compare the difference of distributions between the original and reconstructed data to quantify the utility. Given two distributions
P = (p1, p2, ..., pm ),Q = (q1, q2, ..., qm ),KL distance is defined asand
χ2 -distance is dx2(P, Q )=In our evaluation, we first investigate utility vs. privacy of the randomization method in protecting attribute disclosure on two aspects: 1) compare data utility among different scenarios,and 2) the impact of cardinality upon data utility. Then, we compare randomization with
l -diversity and anatomy.We treated
Education, Martial-status, Gender , andRace as the quasi-identifier and usedWorkclass as the sensitive attribute. We term this data set EMGRW. We distort onlyQI attributes, or sensitive attributeS , or both in different application scenarios for randomization. Table 3 shows the derived randomization parameter p for three scenarios (RR-QI ,RR-S , andRR-Both ). We setl = 2, 3, 4, 5. We can observe that more distortions(p is away from 1) are needed to achieve better privacy protections (whenl is increased). We can also observe thatpi forQI attributes inRR-Both is generally closer to 1 than that inRR-QI ,because RR-Both could also distort the sensitive attribute in addition to thoseQI attributes to achieve some given privacy protection. Thus, a small magnitude of distortion is needed forQI attributes inRR-Both .Fig. 3 shows results on various distance measures. Naturally,there is a tradeoff between minimizing utility loss and maximizing privacy protection. Fig. 3 indicates that the smaller the distance values, the smaller the difference between the distorted and the original databases, and the better the utility of the distorted database. We can observe that the utility loss (in terms of distance differences) increases approximately linearly with the increasing privacy protection across all three randomization scenarios.
RR-Both achieves the best in terms of utility preservation,because we use the optimal randomization parameters for bothQI and the sensitive attribute.As discussed in Section III, one advantage of the randomization scheme is that the more data we have, the more accurate reconstruction we can achieve. We generate four more data sets with varied sizes by sampling
r * N tuples from the Adult Data randomly where N is the size of the original Adult data set and we set r ∈ [0.5, 1.5, 2, 2.5] to investigate the impact of data size upon the data utility of the randomized data. All four generated data sets have the exact same distribution as the original one.Fig. 4 shows the accuracy of reconstructed data distribution when data size increases. We see that data utility is further improved when more data are available.> B. Comparison to Other Models
We chose
Education, Salary, Gender, Race asQI , andOccupation as the sensitive attribute, similar to the settings of empirical evaluations in [2, 5] to compare randomization scheme withl -diversity and anatomy. We term this data set ESGRO. We did not use the previous EMGRW data set, becausel -diverse partitions cannot be derived by the anatomy algorithm or entropy ldiversity.As specified in Machanavajjhala et al. [2] and Xiao and Tao [5], anl -diverse partition exists, if and only if at mostN/l records are associated with the same sensitive value, whereN is the cardinality of the data set. The data distribution of attributeWork-class is much skewed and the frequency of one value is much larger than the others. In this section, we focus on the use ofRR-QI to compare to those two group based models:l -diversity and anatomy, because the overall distribution of sensitive attribute values is unchanged in both group based methods after anonymization andRR-QI after randomization.Generalization approaches usually measure the utility syntactically by the number of generalization steps applied to quasi-identifiers [42], average size of quasi-identifier equivalence classes [2], sum of squares of class sizes, or preservation of marginals [3]. We further examine the utility preservation from the perspective of answering aggregate queries, in addition to the previous distribution distance measures, since the randomization scheme is not based on generalization or suppression.We adopt query answering accuracy; the same metric is used in Zhang et al. [4]. We also consider the variation of correlations between the sensitive attribute and quasi-identifiers. We used an implementation of the Incognito algorithm [3] to generate the entropy
l -diverse tables and used the anatomy algorithm in [5] in our experiments.1) Distribution Distance: We compare data utility of reconstructed probability distributions for different models according to the distance measures. Fig. 5 shows distances between X and π for anatomy, l-diversity and RR-QI on the data set ESGRO.We can observe that randomization outperforms both anatomy and l-diversity methods, because we can partially recover the original data distribution in the randomization scheme, whereas data distribution within each generalized equivalence class is lost in l-diversity generalization and the relations between the quasi-identifier table (QIT) and the sensitive table (ST) are also lost in anatomy.
Another observation is that data utility (in terms of distance between original and reconstructed distributions) monotonically decreases with the increment of the privacy thresholds (l) for randomization and anatomy. This is naturally expected, because more randomness needs to be introduced with the increment of the privacy requirements for randomization and larger l in anatomy means that more tuples are included in each group, which decreases the accuracy of the estimate for the distribution of the original data. However, there is no similar monotonic trend in l-diversity,because the generalization algorithm chooses different attributes for generalization with various l values; this makes the accuracy of the estimated distribution vary to different extents.
2) Query Answering Accuracy:The accuracy of answering aggregate queries is one of the important aspects to evaluate the utility of distorted data. We compare randomization against two other anonymization approaches [3, 5] using the average relative error in the returned count values. For each count value(corresponding to the number of records in a group), its relative error is defined as |act - est|/act, where act is the actual count value from the original data, and est is the estimate from the reconstructed data for RR approaches or the anonymized data for anonymization approaches. We consider two types of queries in our evaluation.
- Base group-by query with the form:SELECT A1,. . . , Am, S, COUNT(*) FROM data
WHERE A1 = i1 . . . Am = im AND S = is GROUP BY (A1, · · · ,
Am, S)
Where ik ∈ Ωk and is ∈ Ωs.
- Cube query with the form:
SELECT A1,· · · , Am, S, COUNT(*) FROM data
GROUP BY CUBE (A1, · · · , Am, S)
The base group-by query returns the number of records in each group of (
A1 = i1 . . .Am = im andS = is ). Note that the total number of groups is equal D = ds Πi-1m di).After running the base group-by query on the original data and the reconstructed data or the anonymized data, we get the count values of act and est, respectively. Fig. 6a shows the calculated average relative errors of three methods (anatomy,
l -diversity,andRR-QI ) with varied l (l = 1, · · · , 7). We use the CUBE query to describe all the possible hierarchical aggregate queries (In multidimensional jargon, a cube is a cross-tabulated summary of detail rows. CUBE enables a SELECT statement to calculate subtotals for all possible combinations of a group of dimensions. It also calculates a grand total.). The CUBE query returns aggregate values of all possible combinations ofQI attributes. We calculate the average relative error for eachl = 1,· · · , 7 and show the results in Fig. 6b. We can observe from Fig. 6a and 6b that randomization permits significantly more accurate aggregate analysis than bothl -diversity and anatomy, since it can recover more accurate data distribution. Conversely,l -diversity loses the data distribution of records within each generalizedQI -group and anatomy loses correlations between theQI attributes and the sensitive attributeS .3) Correlation Between Attributes: A good publishing method should also preserve data correlation (especially between QI and sensitive attributes). We use Uncertainty Coefficient(U) to evaluate the correlation between two multi-category variables:
The uncertainty coefficient takes values between -1 and 1;larger values represent a strong association between variables.When the response variable has several possible categorizations,these measures tend to take smaller values, as the number of categories increases.
Table. 4 and 5 show correlation (uncertainty coefficient) val-
ues between each pair of attributes vary under three models(anatomy,
l -diversity andRR-QI ) on the data set ESGRO. We vary l from 2 to 7. We only include correlation values forl = 3,4, and 5 due to the space limitation. We use the attribute pair of Salary (S , oneQI attribute) and Occupation (O, the sensitive attribute) as an example (the column with bold fonts in Table 5)to study how correlations betweenQI andS change. The original uncertainty coefficient is 2.74 × 10-2.RR-QI well achieves correlation preservation, i.e., 2.41, 2.27, and 2.17 (×10-2) forl =3, 4, and 5 respectively. Conversely, the uncertainty coefficient value under anatomy is only 0.20, 0.12, and 0.05 (×10-2) correspondingly.Forl -diversity, it achieves zero correlation preservation whenl = 3, and 5 while it perfectly achieves correlation preservation whenl = 4, because the Salary attribute is generalized to “All” whenl = 3, and 5, while it is unchanged across allQI -groups whenl = 4.l -diversity in general cannot preserve correlation well, because it is intractable to predict whichQI attributes will be generalized in l-diversity.Fig. 7 shows the average value of uncertainty coefficients among attributes for anatomy, l-diversity and RR-Both on the data set ESGRO. As shown in Fig. 7a, randomization achieves the best correlation preservation between the sensitive attribute and quasiidentifiers across all privacy thresholds. Fig. 7b also clearly shows that randomization better preserves correlation among quasi-identifiers than l-diversity does. Please note that anatomy can best achieve the correlation among quasi-identifiers,since it does not change values of quasi-identifiers.
In summary, the evaluation shows that randomization can better preserve utility (in terms of distribution reconstruction,accuracy of aggregate query answering, and correlation among attributes) under the same privacy requirements. Utility loss is significantly smaller than that of generalization or anatomy approaches. Furthermore, the effectiveness of randomization can be further improved when more data are available. The evaluation also showed that the randomization approach can further improve the accuracy of the reconstructed distribution(and hence utility preservation) when more data are available,while generalization and anatomy approaches do not have this property.
In this paper, we investigated attribute disclosure in the case of linking attacks. We compared randomization to other anonymization schemes (l-diversity and anatomy) in terms of utility preservation, with the same privacy protection requirements.Our experimental evaluations showed randomization significantly outperforms generalization, i.e., achieving better utility preservation, while yielding the same privacy protection.
There are several other avenues for future work. We aim to extend our research to handle multiple sensitive attributes. In this paper, we limit our scope, as attackers have no knowledge about the sensitive attribute of specific individuals in the population and/or the table. In practice, this may not be true, since the attacker may have instance-level background knowledge(e.g., the attacker might know that the target individual does not have cancer; or the attacker might know complete information about some people in the table other than the target individual.)or partial demographic background knowledge about the distribution of sensitive and insensitive attributes in the population.Different forms of background knowledge have been studied in privacy preservation data publishing recently. For example, the formal language [22, 24] and maximum estimation constraints[43] are proposed to express background knowledge of attackers.We will investigate privacy disclosure under those background knowledge attacks. We should note that randomization may not outperform generalization in all cases, especially when some specific background knowledge is utilized in the adversary model, because generalization is truthful and the ambiguity is between true tuples (e.g., individual a and b are indistinguishable),whereas randomization can be regarded as adding noise,and noise can be removed when some background knowledge is known.
We will continue our study of further reducing computation overhead of the randomization approach. The execution time of randomization is usually slower than l-diversity and anatomy.For example, the execution time of our RR-Both approach on the Adult data sets was 10 times slower than for generalization approaches due to the heavy cost of determining the optimal distortion parameters with attribute disclosure constraints, especially when the attribute domains are large. One advantage of randomization is that the reconstruction accuracy increases when more data are available. Guo et al. [10] preliminarily conducted theoretical analysis on how the randomization process affects the accuracy of various measures (e.g., support, confidence,and lift) in market basket data analysis. In our future work, we will study the accuracy of reconstruction in terms of bias and variance of estimates in randomizing general microdata.
Finally, we are interested in comparing our randomization approaches to most recently developed generalization approaches(e.g., the accuracy-constrained l-diversification method in [14]).We are also interested in combining the derived optimal randomization schemes with the small domain randomization [38]to further improve utility preservation. It is our belief that extensive studies on comparing different privacy preservation data publishing approaches are crucial.
> Appendix 1. Proof of Property 1
We start with applying RR to the sensitive attribute, Pr( r =QI^r) = 1. Without loss of generality, we assume that Sr = 1. Combine other categories 0, 2, . . . , ds - 1 into a new categories, and still use 0 to denote the new category. To make the notation simple,we simply write
as π0 in this proof, then π1 =1 ? π0. Such adjustment does not change the reconstruction probability Pr( S?r = Sr = 1'QIr). After the adjustment, the randomization probabilities are given by
By definition, the posterior probabilities are given by
Combining (13), (14), and (15), we have
Taking the derivative with respect to ps, we have (16) is minimized when ps = 1/ds , and the minimal value is
Following similar strategies, we can prove the general case when we randomize both QI and S.
> Appendix 2. Proof of Result 2
LEMMA 3. If d( π^ , π) = '' π^ π''22 , we have E[d( π^, π)] = trace(Σ), where Σ is the covariance matrix of π^ shown in (2).
PROOF. We know that π^ asymptotically follows the normal distribution N(π, Σ). Let Σ = XΛXT be the eigen-decomposition of Σ, where Λ = diag(λ1, . . . , λn) and XTX = I. Let η = XT (π^ π), then η is normally distributed with E(η) = 0 and
Cov(η) = XT Cov( π^ π)X = XTΣX = Λ.
Notice that Λ is a diagonal matrix, and hence Cov(ηiηj) = 0 if i ≠ j, and Var(ηi) = λi, i.e., ηi independently follows the normal distribution N(0, λi). Therefore, we have:
The last equality is due to the fact that
trace(Σ) = trace(XΛXT) = trace(ΛXTX) = trace(Λ).
We proved the lemma.
LEMMA 4. Let Pi be the randomization matrix specified in(1). When pi ≠ 1/di , we have
where
1 is the column vector, whose cells all equal 1.PROOF. We can re-write Pi as follows:
With
and the binomial inverse theorem[45], we can immediately get Pi-1, and ''Pi-1 F ''2 can be directly calculated from Pi-1 .
LEMMA 5. Let (P-1)i denote the i-th column (or row) of P-1.Then, for any i = 1, 2, . . . , D, we have
PROOF. With Lemma 4, we observe that every row (or column)of Pi-1has the same components, except for a change of order. Since P-1 = Pi-1 ? · · · ? Ps-1 , we can also conclude that all rows (or columns) of P-1 have the same components,except for a change of order. Therefore, for all i’s,
Next, we prove the main result. With Lemma 3, when the distance is the squared Euclidean distance, to minimize E[d(π? ,π)] is equivalent to minimize trace(Σ). With (2), we have
Notice that the constraint function is an increasing function of pi, the optimal solution must occur when the equality stands,and we have proved the result.
-
[Table 1.] 2 × 3 contingency tables for two variables gender (QI) disease(sensitive)
-
[Fig. 1.] Randomization based privacy-preserving data publishing.
-
[Fig. 2.] Pr(Sr'QIr) vs. randomization parameters. QI: quasiidentifier.
-
[Table 2.] Description of the Adult data set used in the evaluation
-
[Table 3.] Randomization parameters pi for three cases of RR (data setEMGRW)
-
[Fig. 3.] Distances between and π for three scenarios of RR (data set E π MGRW). RR: randomized response QI: quasiidentifier.
-
[Fig. 4.] For RR-Both distances between π and π decrease as the data set size increases (data set EMGRW). RR: randomized response QI: quasiidentifier.
-
[Fig. 5.] Distances between and π for anatomy l-diversity and RR-QI (π randomized response-quasiidentifier; data set ESGRO).
-
[Fig. 6.] Relative errors of queries for anatomy l-diversity and RR-QI (randomized response-quasiidentifier; data set ESGRO).
-
[Table 4.] Variation of correlation (uncertainty coefficient) between pairs ofquasiidentifier (QI) attributes under different models (×10-2) (data set ESGRO)
-
[Table 5.] Variation of correlation (uncertainty coefficient) betweenquasiidentifier (QI) and S under different models (×10-2) (data set ESGRO)
-
[Fig. 7.] Average value of uncertainty coefficients among attributes for anatomy l-diversity and RR-QI (randomized response-quasiidentifier; data setESGRO).