Limiting Attribute Disclosure in Randomization Based Microdata Release
 Author: Guo Ling, Ying Xiaowei, Wu Xintao
 Organization: Guo Ling; Ying Xiaowei; Wu Xintao
 Publish: Journal of Computing Science and Engineering Volume 5, Issue3, p169~182, 30 June 2011

ABSTRACT
Privacy preserving microdata publication has received wide attention. In this paper, we investigate the randomization approach and focus on attribute disclosure under linking attacks. We give efficient solutions to determine optimal distortion parameters, such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomization approach with ldiversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions,accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.

KEYWORD
Algorithms , Privacy preservation , Randomization , Attribute disclosure , Linking atack

I. INTRODUCTION
Privacy preserving microdata publication has received wide attention [15]. Each record corresponding to one individual has a number of attributes, which can be divided into the following three categories: 1)
identity attributes (e.g., Name and SSN)whose values can uniquely identify an individual; 2)quasiidentifier(QI) attributes (e.g., demographic attributes such as ZIP code, age, and gender) whose values when taken together can potentially identify an individual; 3)sensitive attributes(e.g., disease and income) that indicate confidential information of individuals.Before microdata are released, identity attributes are often directly removed to preserve privacy of individuals whose data are in the released table. However, the
QI information may be used by attackers to link to other public data sets to get the private information of individuals. This is recognized as linking attacks in microdata publishing. Two types of information disclosures have been identified under linking attacks: identity disclosure and attribute disclosure [6]. Identity disclosure occurs if attackers can identify an individual from the released data.Attribute disclosure occurs when confidential information about an individual is revealed and can be attributed to the individual.Samarati and Sweeney [1] proposed thek anonymity model and presented a generalization approach that divides tuples intoQI group by transforming theirQI values into less specific forms,such that tuples in the sameQI group cannot be uniquely identified by attackers to counter linking attacks based on quasiidentifiers.It was identified thatk anonymity is vulnerable to homogeneity and background knowledge attacks when data in theQI group lacks diversity in the sensitive attributes [2].ldiversity [2], as well as following models (e.g.,t closeness [7]),were proposed to protect attribute disclosure.l diversity requires that the sensitive attribute has at least one wellrepresented values for eachQI group in the generalized dataset.The randomization approach has also been adopted to publish microdata [812]. Instead of generalizing
QI attribute values,randomization approach distorts the original value to another domain value according to some distortion probabilities.The application of the randomization technique was studied to prevent identity disclosure under linking attacks in data publishing [12]. They focused on evaluating the risk of successfully linking a target individual to the index of his record given values ofQI attributes. Our research moves one step further to investigate attribute disclosure under linking attacks. We focus on evaluating the risk of successfully predicting the sensitive attribute value of a target individual given hisQI attribute values.We present a general randomization framework and give efficient solutions to determine optimal randomization parameters for bothQI and sensitive attributes. Thus, we can maximize data utility, while satisfying privacy preservation requirements for sensitive attributes. We compare our randomization approach with other anonymization approaches within the framework(e.g., two representative approaches ldiversity [2] and anatomy[5] are used in this paper). Our result shows the randomization approach can better recover the distribution of the original data set from the disguised one. Thus, intuitively, it might yield a disguised database with higher data utility thanl diversity generalization and anatomy.Our contributions are summarized as follows. We present a systematic study of the randomization method in preventing attribute disclosure under linking attacks. We propose the use of a specific randomization model and present an efficient solution to derive distortion parameters to satisfy requirements for privacy preservation, while maximizing data utilities. We propose a general framework and present a uniform definition for attribute disclosure which is compatible for both randomization and generalization models. We compare our randomization approach with
l diversity and anatomy in terms of utility preservation(under the same privacy requirements) from three aspects(reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.The remainder of this paper is organized as follows. In Section II, we discuss closely related work on group based anonymization approaches and randomization approaches in privacy preservation data publishing. In Section III, we present background on randomization based distortions, including analysis and attacks on the randomized data. In Section IV, we first quantify attribute disclosure under linking attacks and then show our theoretical results on maximizing utility with privacy constraints. In Section V, we conduct empirical evaluations and compare the randomization based distortion with two representative group based anonymization approaches (
l diversity [2]and anatomy [5]). We conclude our work in Section VI.II. RELATED WORK
> A. Group Based Anonymization
k anonymity was proposed [1] to counter linking attacks for data publishing. The method generalizes the values of quasiidentifier attributes to lessspecific ones, so that each individual cannot be distinguished from at leastk  1 other individuals based on quasiidentifier information. There has been much study in designing efficient algorithms fork anonymity using generalization and suppression techniques [2, 3, 5, 13, 14].k anonymity provides protection against identity disclosure,while it contributes little to attribute disclosure.l diversity [2] is proposed to counter attribute disclosure. A table isl diverse if,in eachQI group, at most 1/l of the tuples possesses the most frequent sensitive value. Thus, an individual can be linked to his sensitive value correctly with probability at most 1/l . Similarly,the notation oft closeness [7] requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a thresholdt ). The anatomy method was proposed to protect attribute disclosure by breaking the link between quasiidentifiers and sensitive attributes [5]. Anatomy releases all the quasiidentifier and sensitive values directly in two separate tables: aquasiidentifier table (QIT) and a sensitive table (ST). They presented an algorithm to compute anatomized tables that satisfy thel diversity requirement, while minimizing the error of reconstructing the data.Kifer and Gehrke [15] investigated the problem of injecting additional information into kanonymous and
l diverse tables to improve the utility of published data. Koudas et al. [16] studied the problem of preserving privacy through sensitive attribute permutation and generalization with the addition of fake sensitive attribute values. Narayanan and Shmatikov [17] presented a deanonymization methodology for sparse multidimensional microdata, such as individual transactions, preferences, and so on. The deanonymization algorithm could be used by attackers against data sets containing anonymous highdimensional data.Brickell and Shmatikov [18] compared the privacy loss (defined by certain kinds of information learned by attackers) with the utility gain (defined as the same kinds of information learned by analysts) caused by data anonymization and concluded that“even modest privacy gains require almost complete destruction of the data mining utility” in generalization methods. Li and Li[19] proposed the use of theworstcase privacy loss rather than theaverage privacy loss among all individuals adopted in [18]to measure privacy loss. They further presented a methodology to evaluate the privacyutility tradeoff, where privacy loss is quantified by the adversary’s knowledge gain about the sensitive values of specific individuals with the triviallyanonymized data as the baseline and utility loss is measured by the information loss about the sensitive values of large populations with the original data as the baseline. In our paper, we use 1/l to bound the attribute disclosure for all records and use the difference of distributions between the original data and the reconstructed data to quantify utility loss. We also examine the utility loss from the perspective of answering aggregate queries and preserving correlations between the sensitive attribute andQI attributes.In this paper, we focus on the randomization model in resisting attribute disclosure of data publishing. The proposed randomization model is compared to traditional
l diversity [2] and anatomy [5] models. Recently, Ghinita et al. [14] proposed a framework to solve the privacyconstrained and accuracy constrained data anonymization problems under thek anonymity andl diversity models. They developed efficient heuristics to solve the onedimensional problems in linear time and generalized solutions to multidimensional attributes using spacemapping techniques. Their empirical evaluation results showed that privacy/accuracyconstrained methods outperform existing generalization approaches in terms of the execution time and information loss. We would also point out that researchers have developed various anonymity models, includingp sensitivek anonymity[20], (α, k )anonymity [21], (k, e )anonymity [4], (c,k )safety [22],m confidentiality [23], privacy skyline [24],δ presence [25], (ε, m )anonymity [26], and (t, λ )uniqueness[27]. Detailed comparisons to these works are beyond the scope of this paper.> B. Randomization
The first randomization model, called Randomized Response(RR), was proposed by Warner [28] in 1965. The model deals with one dichotomous attribute, i.e., every person in the population belongs either to a sensitive group
A , or to its complement？ . The problem is to estimateπ_{A} , the unknown proportion of population members in groupA . Each respondent is provided with a randomization device by which the respondent chooses one of the following two questions Do you belong toA or Do you belong to？ with respective probabilitiesp and 1 p and then repliesyes orno to the question chosen. The technique provides response confidentiality and increases respondents’ willingness to answer sensitive questions, because no one but the respondent knows to which question the answer pertains. It has been extended to the field of privacy preserving data mining[29].Aggarwal and Yu [30] provided a survey of randomization models for privacy preservation data mining, including various reconstruction methods for randomization, adversarial attacks,optimality and utility of randomization. Rizvi and Haritsa [9]developed the MASK technique to preserve privacy for frequent item set mining. Agrawal and Haritsa [31] presented a general framework of random perturbation in privacy preservation data mining. Du and Zhan [8] studied the use of a randomized response technique to build decision tree classifiers. Guo et al. [10] investigated data utility in terms of the accuracy of reconstructed measures in privacy preservation market basket data analysis. Aggarwal [32] proposed adding noise from a specified distribution to produce a probabilistic model of kanonymity.Zhu and Liu [33] investigated the construction of optimal randomization schemes for privacy preservation density estimation and proposed a general framework for randomization using mixture models. RebolloMonedero et al. [34]defined the privacy measure similar to tcloseness [7] and formalized the privacyutility tradeoff from the information theory perspective. The resulting solution turns out to be the postrandomization(PRAM) method [35] in the discrete case and a form of random noise addition in the general case. They also proved that the optimal perturbation is in general randomized,rather than deterministic in the continuous case.
Huang and Du [36] studied the search of optimal distortion parameters to balance privacy and utility. They developed an evolutionary multiobjective optimization method to find optimal distortion matrices when applying the RR technique on a single attribute. The complexity of the algorithm is
O (((N_{Q} +N_{V} )^{3} +n^{2} )L ) whereN_{Q} is the population size, NV is the archive size,n is the number of columns in the RR matrix, andL is the maximum number of iterations. However, there is no guarantee that the derived matrices are optimal in the entire search space and the performance tends to be sacrificed for multiple attributes due to the intractable size of the search space. Similarly,Xiao et al. [37] investigated the optimal random perturbation at multiple privacy levels.In our paper, we focus on a specific randomization model(i.e., perturbation retains the original value of an attribute
A_{i} with probabilityp_{i} and replaces the original value with a random value from the domain ofA_{i} , see Equation 1) and derive an efficient solution to determine the optimal randomization parameters under linking attacks. Chaytor and Wang [38] also applied the use of this simple randomization model to randomize a sensitive value only within a subset of the entire domain.The developed algorithm, calledSmall Domain Randomization ,can effectively derive optimal partitions, such that randomizing the sensitive attribute values of records (with small domains) in each partition separately can retain more data utility with the same level of privacy than randomizing the whole data set over the entire domain. Our work focuses on attribute disclosure when applying full domain randomization on bothQI attributes and sensitive attributeS . The derived optimal randomization parameter values under privacy constraints can be incorporated in the small domain randomization.III. PRELIMINARIES
Dataset
T containsN records andm + 1 categorical attributes:A_{1} ,A_{2} , . . . ,A_{m} , andS . We useQI = {A_{1} , . . . ,A_{m} } to denote the set of quasiidentifier attributes (e.g., demographic) whose values may be known to the attacker for a given individual and useS to denote one sensitive attribute whose value should not be associated with an individual by attackers. Generally,T may also contain other attributes that are neither sensitive nor quasiidentifying.Those attributes are usually kept unchanged in the released data. We exclude them from our setting, since they do not incur privacy disclosure risk or utility loss. All of the discussions in this paper are also explained in the single sensitive attribute setting and can be generalized to multiple sensitive attributes.Attribute
A_{i} hasd_{i} categories denoted by 0,1,. . . ,d_{i}  1. We use Ω_{i} to denote the domain ofA_{i} (S ), Ω_{i} = {0, 1, . . . ,d_{i}  1},and Ω_{QI} = Ω_{1} × · · · × Ω_{m} is the domain of quasiidentifiers. Similarly,attributeS hasd_{s} categories denoted by 0,1,. . . ,d_{s}  1. We use Ω_{s} = {0, 1, . . . ,d_{s}  1} to denote the domain ofS . Ther th recordR_{r} is denoted by (A_{1r} ,A_{2r} , . . . ,A_{mr} ,S_{r} ) or simply (Q_{Ir} ,S_{r} ).LetD = d_{s} Π^{m}_{i1}d_{i} be the total number of cells in the contingency table.Let π_{i1, … , i m, is}denote the true proportion corresponding to the categorical combination (A_{1} = i_{1}, · · · , A_{m} = i_{m}, S = i_{s}). Let π_{i1}be the column vector with
D elements π_{il}, · · ·, _{im}, _{is} arranged in a fixed order. Table 1 shows one contingency table example for a data set with oneQI attribute (Gender, d_{1} = 2) and one sensitive attribute S (Disease, d_{s} = 3). Table 1a shows the original contingency table where π = (π_{00}, π_{01}, π_{02}, π_{10}, π_{11}, π_{12})' corresponds to a fixed order of cell entries π_{ij} in the 2 × 3 contingency table. π_{10} denotes the proportion of records with Female andCancer . The column sum π_{+0} represents the proportion of records withCancer across both genders. Note that contingency tables are widely used in statistics to record and analyze the relationship betweencategorical attributes. Results of most data mining tasks (e.g.,clustering, decision tree learning, and association rule mining),as well as aggregate queries, are solely determined by the contingency table. That is, analysis on contingency tables is equivalent to analysis on the original categorical data. Table 1b shows one contingency table instance derived from the data set with 100 tuples. We will use this contingency table as an example to illustrate properties of link disclosure.
> A. Distortion Procedure
We use the left part of Fig. 1 to illustrate the process of privacy preserving data publishing. For each record
R_{j} , the data owner independently randomizes attributeA_{i} using the distortion matrixP_{i} . Specifically, for attributeA_{i} (orS ) withd_{i} categories,the randomization process is to change a record belonging to the vth category (v = 0, . . . ,d_{i}  1) to the uth category with probabilityp_{uv}^{(i)} : Pr(A_{i} =u 'A_{i} =v ) =p_{uv}^{(i)} . Letp_{i} = [p_{uv}^{(i)} ]_{di×di} , and we callP_{i} (orP_{s} ) the distortion matrix forA_{i} (or S). Naturally,the column sums ofP_{i} are equal to 1. The original databaseT is changed toT , and then both the randomized data set and the distortion matrices are published. The randomization matrices indicate the magnitude of the randomization; this can help data analysts estimate the original data distribution.Let λ denote the contingency table of the randomized data
T . We arrange λ into the column vector with the same order of π.. Table 1c shows one example of the randomized contingency table. The randomized contingency table has a close relationship to the original contingency table and the randomization matrices: E(λ) = P π where P = P_{1} ？ · · · ？ P_{m} ？ P_{s}, and ？stands for the Kronecker product. The Kronecker product is an operation on two matrices, anm byn matrixA and ap byq matrixB , resulting in the mpbynq block matrix.Distortion matrices determine the privacy and utility of the randomized data. How to find optimal distortion parameters with privacy or utility constraints has remained a challenging problem [39]. In this paper, we limit the choice of randomization parameters for each
QI attributeA_{i} (and sensitive attribute S) as:That is, for each attribute
A_{i} , all categories have the same probabilityp_{i} to remain unchanged, and have the same probabilityq_{i} to be distorted to a different category. With this choice limit, we can derive an efficient algorithm (with explicit formula)to determine the optimal randomization parameters (as shown in Section IVB).> B. Analysis on the Randomized Data
One advantage of randomization procedure is that the pure random process allows data analysts to estimate the original data distribution based on the released data and the randomization parameters. The right part of Fig. 1 shows how data analysts estimate the original data distribution. With the randomized data
T and its contingency table λ, the unbiased estimate of π is given by π = P^{1}λ and the covariance matrix of the estimator iswhere (Pπ)^{δ} stands for the diagonal matrix, whose diagonal values are Pπ.
Thus, data analysts derive the estimation for the distribution of original data (in terms of contingency table) without disclosing the individual information of each record. We choose the accuracy of reconstructed distribution as the target utility function in Section IVB, because most data mining applications are based on the probability distribution of the data.
> C. Attacks on Randomized Data
Let
X be an attribute or a subset of attributes in the data setT with domain Ω_{X}, and X is the randomized value of X inT . It is not reasonable for attackers to regard the observed value as the true value of X, with the randomization process and parameters.Instead, attackers can try to estimate the original value based on the observed data and the released randomization parameters.Let X denote the attackers’ estimation on the original value ofX . Any value in Ω_{X} is possible due to the randomization procedure.We assume that the attacker is able to calculate the posterior probability of content in the data set and takes the following probabilistic strategy: for anyμ ,ν ∈ Ω_{X},where Pr(X =
μ ' X=ν ) denotes the attacker’s posterior belief on the original valueX =μ when he observes X =ν . With the Bayes’ theorem, it can be calculated by:The following lemma gives the accuracy of the attacker’s estimation.
LEMMA 1.
Suppose attackers adopt the probabilistic strategy specified in (3)to estimate the data. The probability that attackers accurately estimate the original value of X is given by :PROOF. For a particular observed value X =
ν ∈ Ω_{X},Pr(X =X =μ , X = ν) = Pr( = ν'X = μ) Pr(X = μ' =ν ). Then, with the law of total probability, we haveThe probability of attackers’ correct estimation is also defined as the reconstruction probability, Pr(
X =μ →X =μ ), in [9, 12].It captures the roundtrip of going from the original content to the distorted one and then returning to estimate the value of the original content. That is, it indicates how much information of the original data is preserved after randomization. To make the notation concise, we adopt the notation Pr(X=X =μ ) to denote the reconstruction probability to evaluate the risk of the sensitive attribute disclosure.IV. ATTRIBUTE DISCLOSURE UNDER LINKING ATTACKS
We measure privacy in terms of the attribute disclosure risk,whose formal definition is given, as follows:
DEFINITION 1.
The attribute disclosure risk under linking attacks is defined to be the probability that the attacker predicts Sr successfully given QI_{r} of a target individual r, denoted as Pr(S_{r}'QI_{r}). We need to quantify the background knowledge of attackers to derive the disclosure probability Pr(
S_{r} 'QI_{r} ). We have the following standard assumptions for background knowledge of attackers in this paper. We assume that the attacker has access to the published data set T and he knows that T is a randomized version of some base tableT . The attacker knows the domain of each attribute ofT . We also assume that the attacker can obtain theQI values of the target individual (e.g., Alice) from some public database or background knowledge and knows that the target individual is definitely contained in the published data.However, he has no knowledge of which record in the published data belongs to the target individual. Finally, we assume that the distortion matricesP_{i} are available to the attacker, because they are necessary for data miners to conduct analysis. The algorithm ofl diversity in [2] preserves privacy by generalizing theQI attributes to formQI groups. Individuals in the group are linked to any sensitive attributes with probability at most 1/l , i.e.,Pr(S_{r} 'QI_{r} ) ≤ 1/l . However, the randomization based approach achieves the privacy protection probabilistically. In the following subsection, we show how to quantify the attribute disclosure risk in the randomization settings.> A. Quantifying Attribute Disclosure
When there is no randomization applied, for those records with their quasiidentifiers equal to
QI_{r} , the attacker simply regards every record as having the same probability to represent individualr . The risk of sensitive attribute disclosure is equal tobecause there are π_{QIr} records within the group of
QI_{r} , and π_{QIr, Sr} of them have the sensitive value equal toS_{r} . This case corresponds to the worst case of the attribute disclosure risk. When randomization is applied, the attribute disclosure risk will be reduced, because the randomization increases the attacker’s uncertainty.1) Randomize S only (RRS): When data owners only apply randomization to the sensitive attribute, for each record within the group of QIr, the attacker takes a guess on its sensitive value using the observed sensitive value and the posterior probability in (4). According to (5), the probability of a correct estimation is Pr( = ？r'QIr), then the risk of sensitive attribute disclosure is
2) Randomize QI only (RRQI): Similarly as RRS, when data owners only apply randomization to the quasiidentifiers,the probability of correctly reconstructing QIr is given by Pr(Xr= QIr), and hence the risk of sensitive attribute disclosure is
3) Randomize QI and S (RRBoth): When data owners apply randomization to both QI and S, the attacker first needs to ensure the values of identifier attributes are correctly reconstructed.The probability is given by Pr(QIr = QIr). Second, the attacker needs to ensure the value of the sensitive attribute,given the correctly reconstructed identifier attribute values are correctly reconstructed. We summarize the risk of sensitive attribute disclosure in RRBoth , as well as RRS and RRQI, in the following result and give the general calculation of the attribute disclosure risk in the randomization settings.
RESULT 1. Assume an individual r has quasiidentifier QI_{r} =α = {i_{1}, i_{2}, . . . , i_{m}}, i_{k} ∈ Ω_{k}, and his sensitive attribute S_{r} = u, u∈ Ω_{s}. The probability of successfully predicting the sensitive attribute value S_{r} of the target individual r given his quasiidentifier values QI_{r} is: We give the formal expressions of the two reconstruction probabilities needed in calculating Pr(
S_{r} 'QI_{r} ) in (6). The reconstruction probability of the quasiidentifier is given by:where Pr(QI^{~}_{r} =
β 'QI_{r} =α ) =is the probability that α ={
i_{1} ,i_{2} , . . . ,i_{m} } is distorted toβ = {j_{1} ,j_{2} , . . . ,j_{m} }.The reconstruction probability of the sensitive attribute
S for the target individual with the quasiidentifierQI_{r} is given by:We are interested in when the attribute disclosure reaches the minimum. We show our results in the following property and include our proof in the Appendix.
PROPERTY 1. Given QI_{r} of individual r: for RRS, Pr(S_{r}'QI_{r}) is minimized when for RRQI, Pr(S_{r}'QI_{r}) is minimized when for RR
Both , Pr(S_{r}'QI_{r}) is minimized when Example. We use the instance shown in Table 1b to illustrate this property. For an individual r with (
QI_{r} = Female, S_{r} = Cancer ),we randomize S (Disease) withp_{s} andQI (Gender) withp_{G} independently. Fig. 2 shows how the attribute disclosure is varied when we apply different randomization parameters. We can see that Pr(S_{r} 'QI_{r} ) reaches the maximum (i.e.,when no randomization is introduced. Fig. 2a shows the scenario when
we only randomize S. We can see that min
when P_{s} = 1/d_{s} = 1/3_{s}
Fig. 2b shows the scenario when we only randomize
QI . We can see that min Pr(S_{r} 'QI_{r} ) = π_{10} = 0.12 when p_{G} = 1/d_{c} = 1/2Fig. 2c shows the case where randomization is applied to both
QI and S. Pr(S_{r} 'QI_{r} ) reaches the minimum only when bothp_{s} =1/3 and p_{G} = 1/2 and minComputational Cost. The main computation cost in (6)comes from calculating Pr(QI _{r} =QI_{r} ). LetP_{QI} be the distortion matrix on quasiidentifiers:P_{QI } = P_{1} ？ · · · ？ P_{m}, π_{QI} be the contingency table on all quasiidentifiers arranged in a column vector,and λ_{QI} denote the expectedQI contingency table of the randomized data: λ_{QI} =P_{QI} π_{QI} . Then, the denominator in (7) is exactly the cell ofλ_{QI} corresponding toβ . Letη denote the column vector of the reconstruction probabilities of quasiidentifiers,arranged in the same order ofπ_{QI} . We can further express(7) in matrix form:where × denotes the componentwise multiplication, P
_{i} ^{2} is the componentwise square of P_{i} , andλ_{QI}^{1} is the componentwise inverse ofλ_{QI} .In (9), we need to repeatedly calculate (P_{1} ？ · · · ？ P_{m})
,wherex denotes a column vector. Assume we use the naive algorithm in all matrix multiplications. Calculating (P_{1} ？ · · · ？ P_{m})x directly results in the time and storage complexity ofx O (π_{i} d_{i}^{2} ). The main storage complexity is from storing matrixP_{QI} . The following lemma allows us to reduce the cost of such computation:LEMMA 2.
Let A, B and X be the matrices of size n × n, m × m,and m × n. Then (
A ？ B ) vec(X ) = vec(BXA^{T} ),where vec(X) denotes the vectorization of the matrix X formed by stacking the columns of X into a single column vector. Applying Lemma 2 recursively, we can reduce the time complexity to
O([Σ_{i}d_{i}] π_{i}d_{i}) . The storage complexity is also reduced toO(π_{i}d_{i} + Σ_{i}d_{i}^{2}) ; this is mainly used to store thecontingency table.
> B. Maximizing Utility with Privacy Constraints
The ultimate goal of publishing data is to maximize utility,while the minimizing risk of attribute disclosure simultaneously.The utility of any data set, whether randomized or not,is innately dependent on the tasks that one may perform on it.Without a workload context, it is difficult to say whether a data set is useful or not. This motivates us to focus on the data distribution when evaluating the utility of a database, since many data mining applications are based on the probability distribution of the data.
where E[d(.)] denotes the expectation of d(.) and d(X?, π)denotes a certain distance between X and π. We can set the privacy threshold formalize as the same privacy requirement of ldiversity (i.e., τ = 1/
l ) to compare the performance of different disguised schemes. That is, we would determine the optimal randomization parameters (p_{1}, p_{2}, . . . , p_{m},and p_{s} ) to maximize the utility, while ensuring that the sensitive value of any individual involved in the data set cannot be correctly inferred by an attacker with probability more than 1/l . A larger l leads to stronger privacy protection. In general, privacy constraints may be flexible. For example, different individuals may have different concerns about their privacy, so we can set different thresholds for Pr(S_{r} 'QI_{r} ).Problem 1 is a nonlinear optimization problem. In general,we can solve it using optimization packages (e.g., trust region algorithm [40]). In Section IVA, we discussed how to efficiently calculate the attribute disclosure of target individuals(shown as constraints in PROBLEM 1). Next, we show how we can efficiently calculate
E [∥Xπ_{2}^{2}∥], the expected Euclidean distance difference between the original data and the estimated one.RESULT 2.
When d(X, π) is the squared Euclidean distance,Problem 1 is equivalent to: determine p_{i}, i = 1, · · · ,m, and p_{s} to squared Euclidean distance, with Lemma (3) shown in the Appendix, to minimize d(X, π) is equivalent to minimize trace(Σ) where Σ is the covariance matrix of the estimator of cell values in the contingency table (shown in Equation 2). Calculating trace(Σ) still involves high computational cost. However,when
P_{i} has the specific form shown in (1), we can further reduce the problem to minimizing π_{i} ┃P_{i}^{1}∥_{F}^{2}, and with Lemma 4,we havewhere 1 is the column vector whose cells are all equal to 1.
V. EMPIRICAL EVALUATION
We ran our experiments on the Adult Database from the UCI data mining repository [41] in our evaluations. The same database has been used in previous work on
k anonymity,l diversity,t closeness, and anatomy [25]. The Adult Database contains 45,222 tuples from US census data and 14 attributes. Table 2 is a summary description of the data including the attributes we used, the number of distinct values for each attribute, and the types of the attributes adopted in the evaluation.It is expected that a good publication method should preserve both privacy and data utility. We set different l values as privacy disclosure thresholds. We adopt the following standard distance measures to compare the difference of distributions between the original and reconstructed data to quantify the utility. Given two distributions
P = (p_{1}, p_{2}, ..., p_{m} ),Q = (q_{1}, q_{2}, ..., q_{m} ),KL distance is defined asand
χ^{2} distance is d_{x2}(P, Q )=In our evaluation, we first investigate utility vs. privacy of the randomization method in protecting attribute disclosure on two aspects: 1) compare data utility among different scenarios,and 2) the impact of cardinality upon data utility. Then, we compare randomization with
l diversity and anatomy.> A. Randomization
We treated
Education, Martialstatus, Gender , andRace as the quasiidentifier and usedWorkclass as the sensitive attribute. We term this data set EMGRW. We distort onlyQI attributes, or sensitive attributeS , or both in different application scenarios for randomization. Table 3 shows the derived randomization parameter p for three scenarios (RRQI ,RRS , andRRBoth ). We setl = 2, 3, 4, 5. We can observe that more distortions(p is away from 1) are needed to achieve better privacy protections (whenl is increased). We can also observe thatp_{i} forQI attributes inRRBoth is generally closer to 1 than that inRRQI ,because RRBoth could also distort the sensitive attribute in addition to thoseQI attributes to achieve some given privacy protection. Thus, a small magnitude of distortion is needed forQI attributes inRRBoth .Fig. 3 shows results on various distance measures. Naturally,there is a tradeoff between minimizing utility loss and maximizing privacy protection. Fig. 3 indicates that the smaller the distance values, the smaller the difference between the distorted and the original databases, and the better the utility of the distorted database. We can observe that the utility loss (in terms of distance differences) increases approximately linearly with the increasing privacy protection across all three randomization scenarios.
RRBoth achieves the best in terms of utility preservation,because we use the optimal randomization parameters for bothQI and the sensitive attribute.As discussed in Section III, one advantage of the randomization scheme is that the more data we have, the more accurate reconstruction we can achieve. We generate four more data sets with varied sizes by sampling
r * N tuples from the Adult Data randomly where N is the size of the original Adult data set and we set r ∈ [0.5, 1.5, 2, 2.5] to investigate the impact of data size upon the data utility of the randomized data. All four generated data sets have the exact same distribution as the original one.Fig. 4 shows the accuracy of reconstructed data distribution when data size increases. We see that data utility is further improved when more data are available.> B. Comparison to Other Models
We chose
Education, Salary, Gender, Race asQI , andOccupation as the sensitive attribute, similar to the settings of empirical evaluations in [2, 5] to compare randomization scheme withl diversity and anatomy. We term this data set ESGRO. We did not use the previous EMGRW data set, becausel diverse partitions cannot be derived by the anatomy algorithm or entropy ldiversity.As specified in Machanavajjhala et al. [2] and Xiao and Tao [5], anl diverse partition exists, if and only if at mostN/l records are associated with the same sensitive value, whereN is the cardinality of the data set. The data distribution of attributeWorkclass is much skewed and the frequency of one value is much larger than the others. In this section, we focus on the use ofRRQI to compare to those two group based models:l diversity and anatomy, because the overall distribution of sensitive attribute values is unchanged in both group based methods after anonymization andRRQI after randomization.Generalization approaches usually measure the utility syntactically by the number of generalization steps applied to quasiidentifiers [42], average size of quasiidentifier equivalence classes [2], sum of squares of class sizes, or preservation of marginals [3]. We further examine the utility preservation from the perspective of answering aggregate queries, in addition to the previous distribution distance measures, since the randomization scheme is not based on generalization or suppression.We adopt query answering accuracy; the same metric is used in Zhang et al. [4]. We also consider the variation of correlations between the sensitive attribute and quasiidentifiers. We used an implementation of the Incognito algorithm [3] to generate the entropy
l diverse tables and used the anatomy algorithm in [5] in our experiments.1) Distribution Distance: We compare data utility of reconstructed probability distributions for different models according to the distance measures. Fig. 5 shows distances between X and π for anatomy, ldiversity and RRQI on the data set ESGRO.We can observe that randomization outperforms both anatomy and ldiversity methods, because we can partially recover the original data distribution in the randomization scheme, whereas data distribution within each generalized equivalence class is lost in ldiversity generalization and the relations between the quasiidentifier table (QIT) and the sensitive table (ST) are also lost in anatomy.
Another observation is that data utility (in terms of distance between original and reconstructed distributions) monotonically decreases with the increment of the privacy thresholds (l) for randomization and anatomy. This is naturally expected, because more randomness needs to be introduced with the increment of the privacy requirements for randomization and larger l in anatomy means that more tuples are included in each group, which decreases the accuracy of the estimate for the distribution of the original data. However, there is no similar monotonic trend in ldiversity,because the generalization algorithm chooses different attributes for generalization with various l values; this makes the accuracy of the estimated distribution vary to different extents.
2) Query Answering Accuracy:The accuracy of answering aggregate queries is one of the important aspects to evaluate the utility of distorted data. We compare randomization against two other anonymization approaches [3, 5] using the average relative error in the returned count values. For each count value(corresponding to the number of records in a group), its relative error is defined as act  est/act, where act is the actual count value from the original data, and est is the estimate from the reconstructed data for RR approaches or the anonymized data for anonymization approaches. We consider two types of queries in our evaluation.
 Base groupby query with the form:SELECT A1,. . . , Am, S, COUNT(*) FROM data
WHERE A1 = i1 . . . Am = im AND S = is GROUP BY (A1, · · · ,
Am, S)
Where ik ∈ Ωk and is ∈ Ωs.
 Cube query with the form:
SELECT A1,· · · , Am, S, COUNT(*) FROM data
GROUP BY CUBE (A1, · · · , Am, S)
The base groupby query returns the number of records in each group of (
A_{1} = i_{1} . . .A_{m} = i_{m} andS = i_{s} ). Note that the total number of groups is equal D = d_{s} Π_{i1}^{m} d_{i}).After running the base groupby query on the original data and the reconstructed data or the anonymized data, we get the count values of act and est, respectively. Fig. 6a shows the calculated average relative errors of three methods (anatomy,
l diversity,andRRQI ) with varied l (l = 1, · · · , 7). We use the CUBE query to describe all the possible hierarchical aggregate queries (In multidimensional jargon, a cube is a crosstabulated summary of detail rows. CUBE enables a SELECT statement to calculate subtotals for all possible combinations of a group of dimensions. It also calculates a grand total.). The CUBE query returns aggregate values of all possible combinations ofQI attributes. We calculate the average relative error for eachl = 1,· · · , 7 and show the results in Fig. 6b. We can observe from Fig. 6a and 6b that randomization permits significantly more accurate aggregate analysis than bothl diversity and anatomy, since it can recover more accurate data distribution. Conversely,l diversity loses the data distribution of records within each generalizedQI group and anatomy loses correlations between theQI attributes and the sensitive attributeS .3) Correlation Between Attributes: A good publishing method should also preserve data correlation (especially between QI and sensitive attributes). We use Uncertainty Coefficient(U) to evaluate the correlation between two multicategory variables:
The uncertainty coefficient takes values between 1 and 1;larger values represent a strong association between variables.When the response variable has several possible categorizations,these measures tend to take smaller values, as the number of categories increases.
Table. 4 and 5 show correlation (uncertainty coefficient) val
ues between each pair of attributes vary under three models(anatomy,
l diversity andRRQI ) on the data set ESGRO. We vary l from 2 to 7. We only include correlation values forl = 3,4, and 5 due to the space limitation. We use the attribute pair of Salary (S , oneQI attribute) and Occupation (O, the sensitive attribute) as an example (the column with bold fonts in Table 5)to study how correlations betweenQI andS change. The original uncertainty coefficient is 2.74 × 10^{2}.RRQI well achieves correlation preservation, i.e., 2.41, 2.27, and 2.17 (×10^{2}) forl =3, 4, and 5 respectively. Conversely, the uncertainty coefficient value under anatomy is only 0.20, 0.12, and 0.05 (×10^{2}) correspondingly.Forl diversity, it achieves zero correlation preservation whenl = 3, and 5 while it perfectly achieves correlation preservation whenl = 4, because the Salary attribute is generalized to “All” whenl = 3, and 5, while it is unchanged across allQI groups whenl = 4.l diversity in general cannot preserve correlation well, because it is intractable to predict whichQI attributes will be generalized in ldiversity.Fig. 7 shows the average value of uncertainty coefficients among attributes for anatomy, ldiversity and RRBoth on the data set ESGRO. As shown in Fig. 7a, randomization achieves the best correlation preservation between the sensitive attribute and quasiidentifiers across all privacy thresholds. Fig. 7b also clearly shows that randomization better preserves correlation among quasiidentifiers than ldiversity does. Please note that anatomy can best achieve the correlation among quasiidentifiers,since it does not change values of quasiidentifiers.
> C. Evaluation Summary
In summary, the evaluation shows that randomization can better preserve utility (in terms of distribution reconstruction,accuracy of aggregate query answering, and correlation among attributes) under the same privacy requirements. Utility loss is significantly smaller than that of generalization or anatomy approaches. Furthermore, the effectiveness of randomization can be further improved when more data are available. The evaluation also showed that the randomization approach can further improve the accuracy of the reconstructed distribution(and hence utility preservation) when more data are available,while generalization and anatomy approaches do not have this property.
VI. CONCLUSION
In this paper, we investigated attribute disclosure in the case of linking attacks. We compared randomization to other anonymization schemes (ldiversity and anatomy) in terms of utility preservation, with the same privacy protection requirements.Our experimental evaluations showed randomization significantly outperforms generalization, i.e., achieving better utility preservation, while yielding the same privacy protection.
There are several other avenues for future work. We aim to extend our research to handle multiple sensitive attributes. In this paper, we limit our scope, as attackers have no knowledge about the sensitive attribute of specific individuals in the population and/or the table. In practice, this may not be true, since the attacker may have instancelevel background knowledge(e.g., the attacker might know that the target individual does not have cancer; or the attacker might know complete information about some people in the table other than the target individual.)or partial demographic background knowledge about the distribution of sensitive and insensitive attributes in the population.Different forms of background knowledge have been studied in privacy preservation data publishing recently. For example, the formal language [22, 24] and maximum estimation constraints[43] are proposed to express background knowledge of attackers.We will investigate privacy disclosure under those background knowledge attacks. We should note that randomization may not outperform generalization in all cases, especially when some specific background knowledge is utilized in the adversary model, because generalization is truthful and the ambiguity is between true tuples (e.g., individual a and b are indistinguishable),whereas randomization can be regarded as adding noise,and noise can be removed when some background knowledge is known.
We will continue our study of further reducing computation overhead of the randomization approach. The execution time of randomization is usually slower than ldiversity and anatomy.For example, the execution time of our RRBoth approach on the Adult data sets was 10 times slower than for generalization approaches due to the heavy cost of determining the optimal distortion parameters with attribute disclosure constraints, especially when the attribute domains are large. One advantage of randomization is that the reconstruction accuracy increases when more data are available. Guo et al. [10] preliminarily conducted theoretical analysis on how the randomization process affects the accuracy of various measures (e.g., support, confidence,and lift) in market basket data analysis. In our future work, we will study the accuracy of reconstruction in terms of bias and variance of estimates in randomizing general microdata.
Finally, we are interested in comparing our randomization approaches to most recently developed generalization approaches(e.g., the accuracyconstrained ldiversification method in [14]).We are also interested in combining the derived optimal randomization schemes with the small domain randomization [38]to further improve utility preservation. It is our belief that extensive studies on comparing different privacy preservation data publishing approaches are crucial.
> Appendix 1. Proof of Property 1
We start with applying RR to the sensitive attribute, Pr( r =QI^{^}_{r}) = 1. Without loss of generality, we assume that S_{r} = 1. Combine other categories 0, 2, . . . , d_{s}  1 into a new categories, and still use 0 to denote the new category. To make the notation simple,we simply write
as π_{0} in this proof, then π_{1} =1 ？ π_{0}. Such adjustment does not change the reconstruction probability Pr( S？_{r} = S_{r} = 1'QI_{r}). After the adjustment, the randomization probabilities are given by
By definition, the posterior probabilities are given by
Combining (13), (14), and (15), we have
Taking the derivative with respect to ps, we have (16) is minimized when p_{s} = 1/d_{s} , and the minimal value is
Following similar strategies, we can prove the general case when we randomize both QI and S.
> Appendix 2. Proof of Result 2
LEMMA 3. If d( π^{^} , π) = '' π^{^} π''_{2}^{2} , we have E[d( π^{^}, π)] = trace(Σ), where Σ is the covariance matrix of π^{^} shown in (2).
PROOF. We know that π_{^} asymptotically follows the normal distribution N(π, Σ). Let Σ = XΛX^{T} be the eigendecomposition of Σ, where Λ = diag(λ_{1}, . . . , λ_{n}) and X^{T}X = I. Let η = X^{T} (π^{^} π), then η is normally distributed with E(η) = 0 and
Cov(η) = X^{T} Cov( π^{^} π)X = X^{T}ΣX = Λ.
Notice that Λ is a diagonal matrix, and hence Cov(η_{i}η_{j}) = 0 if i ≠_{ j}, and Var(η_{i}) = λ_{i}, i.e., η_{i} independently follows the normal distribution N(0, λ_{i}). Therefore, we have:
The last equality is due to the fact that
trace(Σ) = trace(XΛXT) = trace(ΛXTX) = trace(Λ).
We proved the lemma.
LEMMA 4. Let P_{i }be the randomization matrix specified in(1). When p_{i} ≠ 1/d_{i} , we have
where
1 is the column vector, whose cells all equal 1.PROOF. We can rewrite Pi as follows:
With
and the binomial inverse theorem[45], we can immediately get P_{i}^{1}, and ''P_{i}^{1 }_{F} ''^{2} can be directly calculated from P_{i}^{1} .
LEMMA 5. Let (P^{1})_{i} denote the ith column (or row) of P^{1}.Then, for any i = 1, 2, . . . , D, we have
PROOF. With Lemma 4, we observe that every row (or column)of P_{i}^{1}has the same components, except for a change of order. Since P^{1} = P_{i}^{1} ？ · · · ？ P_{s}^{1} , we can also conclude that all rows (or columns) of P^{1} have the same components,except for a change of order. Therefore, for all i’s,
Next, we prove the main result. With Lemma 3, when the distance is the squared Euclidean distance, to minimize E[d(π？ ,π)] is equivalent to minimize trace(Σ). With (2), we have
Notice that the constraint function is an increasing function of pi, the optimal solution must occur when the equality stands,and we have proved the result.

[Table 1.] 2 × 3 contingency tables for two variables gender (QI) disease(sensitive)

[Fig. 1.] Randomization based privacypreserving data publishing.

[Fig. 2.] Pr(Sr'QIr) vs. randomization parameters. QI: quasiidentifier.

[Table 2.] Description of the Adult data set used in the evaluation

[Table 3.] Randomization parameters pi for three cases of RR (data setEMGRW)

[Fig. 3.] Distances between and π for three scenarios of RR (data set E π MGRW). RR: randomized response QI: quasiidentifier.

[Fig. 4.] For RRBoth distances between π and π decrease as the data set size increases (data set EMGRW). RR: randomized response QI: quasiidentifier.

[Fig. 5.] Distances between and π for anatomy ldiversity and RRQI (π randomized responsequasiidentifier; data set ESGRO).

[Fig. 6.] Relative errors of queries for anatomy ldiversity and RRQI (randomized responsequasiidentifier; data set ESGRO).

[Table 4.] Variation of correlation (uncertainty coefficient) between pairs ofquasiidentifier (QI) attributes under different models (×102) (data set ESGRO)

[Table 5.] Variation of correlation (uncertainty coefficient) betweenquasiidentifier (QI) and S under different models (×102) (data set ESGRO)

[Fig. 7.] Average value of uncertainty coefficients among attributes for anatomy ldiversity and RRQI (randomized responsequasiidentifier; data setESGRO).