Limiting Attribute Disclosure in Randomization Based Microdata Release

  • cc icon
  • ABSTRACT

    Privacy preserving microdata publication has received wide attention. In this paper, we investigate the randomization approach and focus on attribute disclosure under linking attacks. We give efficient solutions to determine optimal distortion parameters, such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions,accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.


  • KEYWORD

    Algorithms , Privacy preservation , Randomization , Attribute disclosure , Linking atack

  • I. INTRODUCTION

    Privacy preserving microdata publication has received wide attention [1-5]. Each record corresponding to one individual has a number of attributes, which can be divided into the following three categories: 1) identity attributes (e.g., Name and SSN)whose values can uniquely identify an individual; 2) quasiidentifier(QI) attributes (e.g., demographic attributes such as ZIP code, age, and gender) whose values when taken together can potentially identify an individual; 3) sensitive attributes(e.g., disease and income) that indicate confidential information of individuals.

    Before microdata are released, identity attributes are often directly removed to preserve privacy of individuals whose data are in the released table. However, the QI information may be used by attackers to link to other public data sets to get the private information of individuals. This is recognized as linking attacks in microdata publishing. Two types of information disclosures have been identified under linking attacks: identity disclosure and attribute disclosure [6]. Identity disclosure occurs if attackers can identify an individual from the released data.Attribute disclosure occurs when confidential information about an individual is revealed and can be attributed to the individual.Samarati and Sweeney [1] proposed the k-anonymity model and presented a generalization approach that divides tuples into QI-group by transforming their QI-values into less specific forms,such that tuples in the same QI-group cannot be uniquely identified by attackers to counter linking attacks based on quasi-identifiers.It was identified that k-anonymity is vulnerable to homogeneity and background knowledge attacks when data in the QI-group lacks diversity in the sensitive attributes [2]. ldiversity[2], as well as following models (e.g., t-closeness [7]),were proposed to protect attribute disclosure. l-diversity requires that the sensitive attribute has at least one well-represented values for each QI-group in the generalized dataset.

    The randomization approach has also been adopted to publish microdata [8-12]. Instead of generalizing QI attribute values,randomization approach distorts the original value to another domain value according to some distortion probabilities.The application of the randomization technique was studied to prevent identity disclosure under linking attacks in data publishing [12]. They focused on evaluating the risk of successfully linking a target individual to the index of his record given values of QI attributes. Our research moves one step further to investigate attribute disclosure under linking attacks. We focus on evaluating the risk of successfully predicting the sensitive attribute value of a target individual given his QI attribute values.We present a general randomization framework and give efficient solutions to determine optimal randomization parameters for both QI and sensitive attributes. Thus, we can maximize data utility, while satisfying privacy preservation requirements for sensitive attributes. We compare our randomization approach with other anonymization approaches within the framework(e.g., two representative approaches l-diversity [2] and anatomy[5] are used in this paper). Our result shows the randomization approach can better recover the distribution of the original data set from the disguised one. Thus, intuitively, it might yield a disguised database with higher data utility than l-diversity generalization and anatomy.

    Our contributions are summarized as follows. We present a systematic study of the randomization method in preventing attribute disclosure under linking attacks. We propose the use of a specific randomization model and present an efficient solution to derive distortion parameters to satisfy requirements for privacy preservation, while maximizing data utilities. We propose a general framework and present a uniform definition for attribute disclosure which is compatible for both randomization and generalization models. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation(under the same privacy requirements) from three aspects(reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.

    The remainder of this paper is organized as follows. In Section II, we discuss closely related work on group based anonymization approaches and randomization approaches in privacy preservation data publishing. In Section III, we present background on randomization based distortions, including analysis and attacks on the randomized data. In Section IV, we first quantify attribute disclosure under linking attacks and then show our theoretical results on maximizing utility with privacy constraints. In Section V, we conduct empirical evaluations and compare the randomization based distortion with two representative group based anonymization approaches (l-diversity [2]and anatomy [5]). We conclude our work in Section VI.

    II. RELATED WORK

      >  A. Group Based Anonymization

    k-anonymity was proposed [1] to counter linking attacks for data publishing. The method generalizes the values of quasiidentifier attributes to less-specific ones, so that each individual cannot be distinguished from at least k - 1 other individuals based on quasi-identifier information. There has been much study in designing efficient algorithms for k-anonymity using generalization and suppression techniques [2, 3, 5, 13, 14].

    k-anonymity provides protection against identity disclosure,while it contributes little to attribute disclosure. l-diversity [2] is proposed to counter attribute disclosure. A table is l-diverse if,in each QI-group, at most 1/l of the tuples possesses the most frequent sensitive value. Thus, an individual can be linked to his sensitive value correctly with probability at most 1/l. Similarly,the notation of t-closeness [7] requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold t). The anatomy method was proposed to protect attribute disclosure by breaking the link between quasi-identifiers and sensitive attributes [5]. Anatomy releases all the quasi-identifier and sensitive values directly in two separate tables: a quasiidentifier table (QIT) and a sensitive table (ST). They presented an algorithm to compute anatomized tables that satisfy the l-diversity requirement, while minimizing the error of reconstructing the data.

    Kifer and Gehrke [15] investigated the problem of injecting additional information into k-anonymous and l-diverse tables to improve the utility of published data. Koudas et al. [16] studied the problem of preserving privacy through sensitive attribute permutation and generalization with the addition of fake sensitive attribute values. Narayanan and Shmatikov [17] presented a de-anonymization methodology for sparse multi-dimensional microdata, such as individual transactions, preferences, and so on. The de-anonymization algorithm could be used by attackers against data sets containing anonymous high-dimensional data.Brickell and Shmatikov [18] compared the privacy loss (defined by certain kinds of information learned by attackers) with the utility gain (defined as the same kinds of information learned by analysts) caused by data anonymization and concluded that“even modest privacy gains require almost complete destruction of the data mining utility” in generalization methods. Li and Li[19] proposed the use of the worst-case privacy loss rather than the average privacy loss among all individuals adopted in [18]to measure privacy loss. They further presented a methodology to evaluate the privacy-utility tradeoff, where privacy loss is quantified by the adversary’s knowledge gain about the sensitive values of specific individuals with the trivially-anonymized data as the baseline and utility loss is measured by the information loss about the sensitive values of large populations with the original data as the baseline. In our paper, we use 1/l to bound the attribute disclosure for all records and use the difference of distributions between the original data and the reconstructed data to quantify utility loss. We also examine the utility loss from the perspective of answering aggregate queries and preserving correlations between the sensitive attribute and QI attributes.

    In this paper, we focus on the randomization model in resisting attribute disclosure of data publishing. The proposed randomization model is compared to traditional l-diversity [2] and anatomy [5] models. Recently, Ghinita et al. [14] proposed a framework to solve the privacy-constrained and accuracy constrained data anonymization problems under the k-anonymity and l-diversity models. They developed efficient heuristics to solve the one-dimensional problems in linear time and generalized solutions to multi-dimensional attributes using space-mapping techniques. Their empirical evaluation results showed that privacy/accuracy-constrained methods outperform existing generalization approaches in terms of the execution time and information loss. We would also point out that researchers have developed various anonymity models, including p-sensitive k-anonymity[20], (α, k)-anonymity [21], (k, e)-anonymity [4], (c,k)-safety [22], m-confidentiality [23], privacy skyline [24], δ-presence [25], (ε, m)-anonymity [26], and (t, λ)-uniqueness[27]. Detailed comparisons to these works are beyond the scope of this paper.

      >  B. Randomization

    The first randomization model, called Randomized Response(RR), was proposed by Warner [28] in 1965. The model deals with one dichotomous attribute, i.e., every person in the population belongs either to a sensitive group A, or to its complement . The problem is to estimate πA, the unknown proportion of population members in group A. Each respondent is provided with a randomization device by which the respondent chooses one of the following two questions Do you belong to A or Do you belong to with respective probabilities p and 1 - p and then replies yes or no to the question chosen. The technique provides response confidentiality and increases respondents’ willingness to answer sensitive questions, because no one but the respondent knows to which question the answer pertains. It has been extended to the field of privacy preserving data mining[29].

    Aggarwal and Yu [30] provided a survey of randomization models for privacy preservation data mining, including various reconstruction methods for randomization, adversarial attacks,optimality and utility of randomization. Rizvi and Haritsa [9]developed the MASK technique to preserve privacy for frequent item set mining. Agrawal and Haritsa [31] presented a general framework of random perturbation in privacy preservation data mining. Du and Zhan [8] studied the use of a randomized response technique to build decision tree classifiers. Guo et al. [10] investigated data utility in terms of the accuracy of reconstructed measures in privacy preservation market basket data analysis. Aggarwal [32] proposed adding noise from a specified distribution to produce a probabilistic model of k-anonymity.Zhu and Liu [33] investigated the construction of optimal randomization schemes for privacy preservation density estimation and proposed a general framework for randomization using mixture models. Rebollo-Monedero et al. [34]defined the privacy measure similar to t-closeness [7] and formalized the privacy-utility tradeoff from the information theory perspective. The resulting solution turns out to be the postrandomization(PRAM) method [35] in the discrete case and a form of random noise addition in the general case. They also proved that the optimal perturbation is in general randomized,rather than deterministic in the continuous case.

    Huang and Du [36] studied the search of optimal distortion parameters to balance privacy and utility. They developed an evolutionary multi-objective optimization method to find optimal distortion matrices when applying the RR technique on a single attribute. The complexity of the algorithm is O(((NQ + NV)3 + n2) L) where NQ is the population size, NV is the archive size, n is the number of columns in the RR matrix, and L is the maximum number of iterations. However, there is no guarantee that the derived matrices are optimal in the entire search space and the performance tends to be sacrificed for multiple attributes due to the intractable size of the search space. Similarly,Xiao et al. [37] investigated the optimal random perturbation at multiple privacy levels.

    In our paper, we focus on a specific randomization model(i.e., perturbation retains the original value of an attribute Ai with probability pi and replaces the original value with a random value from the domain of Ai, see Equation 1) and derive an efficient solution to determine the optimal randomization parameters under linking attacks. Chaytor and Wang [38] also applied the use of this simple randomization model to randomize a sensitive value only within a subset of the entire domain.The developed algorithm, called Small Domain Randomization,can effectively derive optimal partitions, such that randomizing the sensitive attribute values of records (with small domains) in each partition separately can retain more data utility with the same level of privacy than randomizing the whole data set over the entire domain. Our work focuses on attribute disclosure when applying full domain randomization on both QI attributes and sensitive attribute S. The derived optimal randomization parameter values under privacy constraints can be incorporated in the small domain randomization.

    III. PRELIMINARIES

    Dataset T contains N records and m + 1 categorical attributes: A1, A2, . . . , Am, and S. We use QI = {A1, . . . , Am} to denote the set of quasi-identifier attributes (e.g., demographic) whose values may be known to the attacker for a given individual and use S to denote one sensitive attribute whose value should not be associated with an individual by attackers. Generally, T may also contain other attributes that are neither sensitive nor quasiidentifying.Those attributes are usually kept unchanged in the released data. We exclude them from our setting, since they do not incur privacy disclosure risk or utility loss. All of the discussions in this paper are also explained in the single sensitive attribute setting and can be generalized to multiple sensitive attributes.

    Attribute Ai has di categories denoted by 0,1,. . . , di - 1. We use Ωi to denote the domain of Ai (S), Ωi = {0, 1, . . . , di - 1},and ΩQI = Ω1 × · · · × Ωm is the domain of quasi-identifiers. Similarly,attribute S has ds categories denoted by 0,1,. . . , ds - 1. We use Ωs = {0, 1, . . . , ds - 1} to denote the domain of S. The r-th record Rr is denoted by (A1r, A2r, . . . , Amr, Sr) or simply (QIr, Sr).Let D = ds Πmi-1di be the total number of cells in the contingency table.

    Let πi1, … , i m, isdenote the true proportion corresponding to the categorical combination (A1 = i1, · · · , Am = im, S = is). Let πi1be the column vector with D elements πil, · · ·, im, is arranged in a fixed order. Table 1 shows one contingency table example for a data set with one QI attribute (Gender, d1 = 2) and one sensitive attribute S (Disease, ds = 3). Table 1a shows the original contingency table where π = (π00, π01, π02, π10, π11, π12)' corresponds to a fixed order of cell entries πij in the 2 × 3 contingency table. π10 denotes the proportion of records with Female and Cancer. The column sum π+0 represents the proportion of records with Cancer across both genders. Note that contingency tables are widely used in statistics to record and analyze the relationship between

    categorical attributes. Results of most data mining tasks (e.g.,clustering, decision tree learning, and association rule mining),as well as aggregate queries, are solely determined by the contingency table. That is, analysis on contingency tables is equivalent to analysis on the original categorical data. Table 1b shows one contingency table instance derived from the data set with 100 tuples. We will use this contingency table as an example to illustrate properties of link disclosure.

      >  A. Distortion Procedure

    We use the left part of Fig. 1 to illustrate the process of privacy preserving data publishing. For each record Rj, the data owner independently randomizes attribute Ai using the distortion matrix Pi. Specifically, for attribute Ai (or S) with di categories,the randomization process is to change a record belonging to the v-th category (v = 0, . . . , di - 1) to the u-th category with probability puv(i): Pr(Ai = u'Ai = v) = puv(i). Let pi = [puv(i)]di×di, and we call Pi (or Ps) the distortion matrix for Ai (or S). Naturally,the column sums of Pi are equal to 1. The original database T is changed to T, and then both the randomized data set and the distortion matrices are published. The randomization matrices indicate the magnitude of the randomization; this can help data analysts estimate the original data distribution.

    Let λ denote the contingency table of the randomized data T. We arrange λ into the column vector with the same order of π.. Table 1c shows one example of the randomized contingency table. The randomized contingency table has a close relationship to the original contingency table and the randomization matrices: E(λ) = P π where P = P1 ? · · · ? Pm ? Ps, and ?stands for the Kronecker product. The Kronecker product is an operation on two matrices, an m-by-n matrix A and a p-by-q matrix B, resulting in the mp-by-nq block matrix.

    Distortion matrices determine the privacy and utility of the randomized data. How to find optimal distortion parameters with privacy or utility constraints has remained a challenging problem [39]. In this paper, we limit the choice of randomization parameters for each QI attribute Ai (and sensitive attribute S) as:

    image

    That is, for each attribute Ai, all categories have the same probability pi to remain unchanged, and have the same probability qi to be distorted to a different category. With this choice limit, we can derive an efficient algorithm (with explicit formula)to determine the optimal randomization parameters (as shown in Section IV-B).

      >  B. Analysis on the Randomized Data

    One advantage of randomization procedure is that the pure random process allows data analysts to estimate the original data distribution based on the released data and the randomization parameters. The right part of Fig. 1 shows how data analysts estimate the original data distribution. With the randomized data T and its contingency table λ, the unbiased estimate of π is given by π = P-1λ and the covariance matrix of the estimator is

    image

    where (Pπ)δ stands for the diagonal matrix, whose diagonal values are Pπ.

    Thus, data analysts derive the estimation for the distribution of original data (in terms of contingency table) without disclosing the individual information of each record. We choose the accuracy of reconstructed distribution as the target utility function in Section IV-B, because most data mining applications are based on the probability distribution of the data.

      >  C. Attacks on Randomized Data

    Let X be an attribute or a subset of attributes in the data set T with domain ΩX, and X is the randomized value of X in T . It is not reasonable for attackers to regard the observed value as the true value of X, with the randomization process and parameters.Instead, attackers can try to estimate the original value based on the observed data and the released randomization parameters.Let X denote the attackers’ estimation on the original value of X. Any value in ΩX is possible due to the randomization procedure.We assume that the attacker is able to calculate the posterior probability of content in the data set and takes the following probabilistic strategy: for any μ, ν ∈ ΩX,

    image

    where Pr(X = μ' X= ν) denotes the attacker’s posterior belief on the original value X = μ when he observes X = ν. With the Bayes’ theorem, it can be calculated by:

    image

    The following lemma gives the accuracy of the attacker’s estimation.

    LEMMA 1. Suppose attackers adopt the probabilistic strategy specified in (3) to estimate the data. The probability that attackers accurately estimate the original value of X is given by:

    image

    PROOF. For a particular observed value X = ν ∈ ΩX,Pr(X = X = μ, X = ν) = Pr( = ν'X = μ) Pr(X = μ' = ν). Then, with the law of total probability, we have

    image

    The probability of attackers’ correct estimation is also defined as the reconstruction probability, Pr(X = μX = μ), in [9, 12].It captures the round-trip of going from the original content to the distorted one and then returning to estimate the value of the original content. That is, it indicates how much information of the original data is preserved after randomization. To make the notation concise, we adopt the notation Pr(X= X = μ) to denote the reconstruction probability to evaluate the risk of the sensitive attribute disclosure.

    IV. ATTRIBUTE DISCLOSURE UNDER LINKING ATTACKS

    We measure privacy in terms of the attribute disclosure risk,whose formal definition is given, as follows:

    DEFINITION 1. The attribute disclosure risk under linking attacks is defined to be the probability that the attacker predicts Sr successfully given QIr of a target individual r, denoted as Pr(Sr'QIr).

    We need to quantify the background knowledge of attackers to derive the disclosure probability Pr(Sr'QIr). We have the following standard assumptions for background knowledge of attackers in this paper. We assume that the attacker has access to the published data set T and he knows that T is a randomized version of some base table T. The attacker knows the domain of each attribute of T. We also assume that the attacker can obtain the QI-values of the target individual (e.g., Alice) from some public database or background knowledge and knows that the target individual is definitely contained in the published data.However, he has no knowledge of which record in the published data belongs to the target individual. Finally, we assume that the distortion matrices Pi are available to the attacker, because they are necessary for data miners to conduct analysis. The algorithm of l-diversity in [2] preserves privacy by generalizing the QI attributes to form QI-groups. Individuals in the group are linked to any sensitive attributes with probability at most 1/l, i.e.,Pr(Sr'QIr) ≤ 1/l . However, the randomization based approach achieves the privacy protection probabilistically. In the following subsection, we show how to quantify the attribute disclosure risk in the randomization settings.

      >  A. Quantifying Attribute Disclosure

    When there is no randomization applied, for those records with their quasi-identifiers equal to QIr, the attacker simply regards every record as having the same probability to represent individual r. The risk of sensitive attribute disclosure is equal to

    image

    because there are πQIr records within the group of QIr, and πQIr, Sr of them have the sensitive value equal to Sr. This case corresponds to the worst case of the attribute disclosure risk. When randomization is applied, the attribute disclosure risk will be reduced, because the randomization increases the attacker’s uncertainty.

    1) Randomize S only (RR-S): When data owners only apply randomization to the sensitive attribute, for each record within the group of QIr, the attacker takes a guess on its sensitive value using the observed sensitive value and the posterior probability in (4). According to (5), the probability of a correct estimation is Pr( = ?r'QIr), then the risk of sensitive attribute disclosure is

    image

    2) Randomize QI only (RR-QI): Similarly as RR-S, when data owners only apply randomization to the quasi-identifiers,the probability of correctly reconstructing QIr is given by Pr(Xr= QIr), and hence the risk of sensitive attribute disclosure is

    image

    3) Randomize QI and S (RR-Both): When data owners apply randomization to both QI and S, the attacker first needs to ensure the values of identifier attributes are correctly reconstructed.The probability is given by Pr(QIr = QIr). Second, the attacker needs to ensure the value of the sensitive attribute,given the correctly reconstructed identifier attribute values are correctly reconstructed. We summarize the risk of sensitive attribute disclosure in RR-Both , as well as RR-S and RR-QI, in the following result and give the general calculation of the attribute disclosure risk in the randomization settings.

    RESULT 1. Assume an individual r has quasi-identifier QIr =α = {i1, i2, . . . , im}, ik ∈ Ωk, and his sensitive attribute Sr = u, u∈ Ωs. The probability of successfully predicting the sensitive attribute value Sr of the target individual r given his quasi-identifier values QIr is:

    image

    We give the formal expressions of the two reconstruction probabilities needed in calculating Pr(Sr'QIr) in (6). The reconstruction probability of the quasi-identifier is given by:

    image

    where Pr(QI~r = β'QIr = α) =

    image

    is the probability that α ={i1, i2, . . . , im} is distorted to β = {j1, j2, . . . , jm}.

    The reconstruction probability of the sensitive attribute S for the target individual with the quasi-identifier QIr is given by:

    image

    We are interested in when the attribute disclosure reaches the minimum. We show our results in the following property and include our proof in the Appendix.

    PROPERTY 1. Given QIr of individual r:

    for RR-S, Pr(Sr'QIr) is minimized when

    image

    for RR-QI, Pr(Sr'QIr) is minimized when

    image

    for RR-Both , Pr(Sr'QIr) is minimized when

    image

    Example. We use the instance shown in Table 1b to illustrate this property. For an individual r with (QIr = Female, Sr = Cancer),we randomize S (Disease) with ps and QI (Gender) with pG independently. Fig. 2 shows how the attribute disclosure is varied when we apply different randomization parameters. We can see that Pr(Sr'QIr) reaches the maximum (i.e.,

    image

    when no randomization is introduced. Fig. 2a shows the scenario when

    we only randomize S. We can see that min

    image

    when Ps = 1/ds = 1/3s

    Fig. 2b shows the scenario when we only randomize QI. We can see that min Pr(Sr'QIr) = π10 = 0.12 when pG = 1/dc = 1/2

    Fig. 2c shows the case where randomization is applied to both QI and S. Pr(Sr'QIr) reaches the minimum only when both ps =1/3 and pG = 1/2 and min

    image

    Computational Cost.The main computation cost in (6)comes from calculating Pr(QIr = QIr). Let PQI be the distortion matrix on quasi-identifiers: PQI = P1 ? · · · ? Pm, πQI be the contingency table on all quasi-identifiers arranged in a column vector,and λQI denote the expected QI contingency table of the randomized data: λQI = PQI πQI. Then, the denominator in (7) is exactly the cell of λQI corresponding to β. Let η denote the column vector of the reconstruction probabilities of quasi-identifiers,arranged in the same order of πQI. We can further express(7) in matrix form:

    image

    where × denotes the component-wise multiplication, Pi2 is the component-wise square of Pi, and λQI-1 is the component-wise inverse of λQI.

    In (9), we need to repeatedly calculate (P1 ? · · · ? Pm)x,where x denotes a column vector. Assume we use the naive algorithm in all matrix multiplications. Calculating (P1 ? · · · ? Pm)x directly results in the time and storage complexity of Oidi2). The main storage complexity is from storing matrix PQI. The following lemma allows us to reduce the cost of such computation:

    LEMMA 2. Let A, B and X be the matrices of size n × n, m × m,and m × n. Then

    (A ? B) vec(X) = vec(BXAT),

    where vec(X) denotes the vectorization of the matrix X formed by stacking the columns of X into a single column vector.

    Applying Lemma 2 recursively, we can reduce the time complexity to O([Σidi] πidi). The storage complexity is also reduced to O(πidi + Σidi2); this is mainly used to store the

    contingency table.

      >  B. Maximizing Utility with Privacy Constraints

    The ultimate goal of publishing data is to maximize utility,while the minimizing risk of attribute disclosure simultaneously.The utility of any data set, whether randomized or not,is innately dependent on the tasks that one may perform on it.Without a workload context, it is difficult to say whether a data set is useful or not. This motivates us to focus on the data distribution when evaluating the utility of a database, since many data mining applications are based on the probability distribution of the data.

    image

    where E[d(.)] denotes the expectation of d(.) and d(X?, π)denotes a certain distance between X and π.

    We can set the privacy threshold formalize as the same privacy requirement of l-diversity (i.e., τ = 1/l ) to compare the performance of different disguised schemes. That is, we would determine the optimal randomization parameters (p1, p2, . . . , pm,and ps) to maximize the utility, while ensuring that the sensitive value of any individual involved in the data set cannot be correctly inferred by an attacker with probability more than 1/l. A larger l leads to stronger privacy protection. In general, privacy constraints may be flexible. For example, different individuals may have different concerns about their privacy, so we can set different thresholds for Pr(Sr'QIr).

    Problem 1 is a nonlinear optimization problem. In general,we can solve it using optimization packages (e.g., trust region algorithm [40]). In Section IV-A, we discussed how to efficiently calculate the attribute disclosure of target individuals(shown as constraints in PROBLEM 1). Next, we show how we can efficiently calculate E[∥X-π22∥], the expected Euclidean distance difference between the original data and the estimated one.

    RESULT 2. When d(X, π) is the squared Euclidean distance,Problem 1 is equivalent to: determine pi, i = 1, · · · ,m, and ps to

    image

    squared Euclidean distance, with Lemma (3) shown in the Appendix, to minimize d(X, π) is equivalent to minimize trace(Σ) where Σ is the covariance matrix of the estimator of cell values in the contingency table (shown in Equation 2). Calculating trace(Σ) still involves high computational cost. However,when Pi has the specific form shown in (1), we can further reduce the problem to minimizing πi ┃Pi-1F2, and with Lemma 4,we have

    image

    where 1 is the column vector whose cells are all equal to 1.

    V. EMPIRICAL EVALUATION

    We ran our experiments on the Adult Database from the UCI data mining repository [41] in our evaluations. The same database has been used in previous work on k-anonymity, l-diversity,t-closeness, and anatomy [2-5]. The Adult Database contains 45,222 tuples from US census data and 14 attributes. Table 2 is a summary description of the data including the attributes we used, the number of distinct values for each attribute, and the types of the attributes adopted in the evaluation.

    It is expected that a good publication method should preserve both privacy and data utility. We set different l values as privacy disclosure thresholds. We adopt the following standard distance measures to compare the difference of distributions between the original and reconstructed data to quantify the utility. Given two distributions P = (p1, p2, ..., pm), Q = (q1, q2, ..., qm), KL distance is defined as

    image

    and χ2-distance is dx2(P, Q)=

    image

    In our evaluation, we first investigate utility vs. privacy of the randomization method in protecting attribute disclosure on two aspects: 1) compare data utility among different scenarios,and 2) the impact of cardinality upon data utility. Then, we compare randomization with l-diversity and anatomy.

      >  A. Randomization

    We treated Education, Martial-status, Gender, and Race as the quasi-identifier and used Workclass as the sensitive attribute. We term this data set EMGRW. We distort only QI attributes, or sensitive attribute S, or both in different application scenarios for randomization. Table 3 shows the derived randomization parameter p for three scenarios (RR-QI, RR-S, and RR-Both). We set l = 2, 3, 4, 5. We can observe that more distortions(p is away from 1) are needed to achieve better privacy protections (when l is increased). We can also observe that pi for QI attributes in RR-Both is generally closer to 1 than that in RR-QI,because RR-Both could also distort the sensitive attribute in addition to those QI attributes to achieve some given privacy protection. Thus, a small magnitude of distortion is needed for QI attributes in RR-Both.

    Fig. 3 shows results on various distance measures. Naturally,there is a tradeoff between minimizing utility loss and maximizing privacy protection. Fig. 3 indicates that the smaller the distance values, the smaller the difference between the distorted and the original databases, and the better the utility of the distorted database. We can observe that the utility loss (in terms of distance differences) increases approximately linearly with the increasing privacy protection across all three randomization scenarios. RR-Both achieves the best in terms of utility preservation,because we use the optimal randomization parameters for both QI and the sensitive attribute.

    As discussed in Section III, one advantage of the randomization scheme is that the more data we have, the more accurate reconstruction we can achieve. We generate four more data sets with varied sizes by sampling r * N tuples from the Adult Data randomly where N is the size of the original Adult data set and we set r ∈ [0.5, 1.5, 2, 2.5] to investigate the impact of data size upon the data utility of the randomized data. All four generated data sets have the exact same distribution as the original one.Fig. 4 shows the accuracy of reconstructed data distribution when data size increases. We see that data utility is further improved when more data are available.

      >  B. Comparison to Other Models

    We chose Education, Salary, Gender, Race as QI, and Occupation as the sensitive attribute, similar to the settings of empirical evaluations in [2, 5] to compare randomization scheme with l-diversity and anatomy. We term this data set ESGRO. We did not use the previous EMGRW data set, because l-diverse partitions cannot be derived by the anatomy algorithm or entropy ldiversity.As specified in Machanavajjhala et al. [2] and Xiao and Tao [5], an l-diverse partition exists, if and only if at most N/l records are associated with the same sensitive value, where N is the cardinality of the data set. The data distribution of attribute Work-class is much skewed and the frequency of one value is much larger than the others. In this section, we focus on the use of RR-QI to compare to those two group based models:

    l-diversity and anatomy, because the overall distribution of sensitive attribute values is unchanged in both group based methods after anonymization and RR-QI after randomization.

    Generalization approaches usually measure the utility syntactically by the number of generalization steps applied to quasi-identifiers [42], average size of quasi-identifier equivalence classes [2], sum of squares of class sizes, or preservation of marginals [3]. We further examine the utility preservation from the perspective of answering aggregate queries, in addition to the previous distribution distance measures, since the randomization scheme is not based on generalization or suppression.We adopt query answering accuracy; the same metric is used in Zhang et al. [4]. We also consider the variation of correlations between the sensitive attribute and quasi-identifiers. We used an implementation of the Incognito algorithm [3] to generate the entropy l-diverse tables and used the anatomy algorithm in [5] in our experiments.

    1) Distribution Distance: We compare data utility of reconstructed probability distributions for different models according to the distance measures. Fig. 5 shows distances between X and π for anatomy, l-diversity and RR-QI on the data set ESGRO.We can observe that randomization outperforms both anatomy and l-diversity methods, because we can partially recover the original data distribution in the randomization scheme, whereas data distribution within each generalized equivalence class is lost in l-diversity generalization and the relations between the quasi-identifier table (QIT) and the sensitive table (ST) are also lost in anatomy.

    Another observation is that data utility (in terms of distance between original and reconstructed distributions) monotonically decreases with the increment of the privacy thresholds (l) for randomization and anatomy. This is naturally expected, because more randomness needs to be introduced with the increment of the privacy requirements for randomization and larger l in anatomy means that more tuples are included in each group, which decreases the accuracy of the estimate for the distribution of the original data. However, there is no similar monotonic trend in l-diversity,because the generalization algorithm chooses different attributes for generalization with various l values; this makes the accuracy of the estimated distribution vary to different extents.

    2) Query Answering Accuracy:The accuracy of answering aggregate queries is one of the important aspects to evaluate the utility of distorted data. We compare randomization against two other anonymization approaches [3, 5] using the average relative error in the returned count values. For each count value(corresponding to the number of records in a group), its relative error is defined as |act - est|/act, where act is the actual count value from the original data, and est is the estimate from the reconstructed data for RR approaches or the anonymized data for anonymization approaches. We consider two types of queries in our evaluation.

    - Base group-by query with the form:SELECT A1,. . . , Am, S, COUNT(*) FROM data

    WHERE A1 = i1 . . . Am = im AND S = is GROUP BY (A1, · · · ,

    Am, S)

    Where ik ∈ Ωk and is ∈ Ωs.

    - Cube query with the form:

    SELECT A1,· · · , Am, S, COUNT(*) FROM data

    GROUP BY CUBE (A1, · · · , Am, S)

    The base group-by query returns the number of records in each group of (A1 = i1 . . . Am = im and S = is). Note that the total number of groups is equal D = ds Πi-1m di).

    After running the base group-by query on the original data and the reconstructed data or the anonymized data, we get the count values of act and est, respectively. Fig. 6a shows the calculated average relative errors of three methods (anatomy, l-diversity,and RR-QI) with varied l (l = 1, · · · , 7). We use the CUBE query to describe all the possible hierarchical aggregate queries (In multidimensional jargon, a cube is a cross-tabulated summary of detail rows. CUBE enables a SELECT statement to calculate subtotals for all possible combinations of a group of dimensions. It also calculates a grand total.). The CUBE query returns aggregate values of all possible combinations of QI attributes. We calculate the average relative error for each l = 1,· · · , 7 and show the results in Fig. 6b. We can observe from Fig. 6a and 6b that randomization permits significantly more accurate aggregate analysis than both l-diversity and anatomy, since it can recover more accurate data distribution. Conversely, l-diversity loses the data distribution of records within each generalized QI-group and anatomy loses correlations between the QI attributes and the sensitive attribute S.

    3) Correlation Between Attributes: A good publishing method should also preserve data correlation (especially between QI and sensitive attributes). We use Uncertainty Coefficient(U) to evaluate the correlation between two multi-category variables:

    image

    The uncertainty coefficient takes values between -1 and 1;larger values represent a strong association between variables.When the response variable has several possible categorizations,these measures tend to take smaller values, as the number of categories increases.

    Table. 4 and 5 show correlation (uncertainty coefficient) val-

    ues between each pair of attributes vary under three models(anatomy, l-diversity and RR-QI) on the data set ESGRO. We vary l from 2 to 7. We only include correlation values for l = 3,4, and 5 due to the space limitation. We use the attribute pair of Salary (S, one QI attribute) and Occupation (O, the sensitive attribute) as an example (the column with bold fonts in Table 5)to study how correlations between QI and S change. The original uncertainty coefficient is 2.74 × 10-2. RR-QI well achieves correlation preservation, i.e., 2.41, 2.27, and 2.17 (×10-2) for l =3, 4, and 5 respectively. Conversely, the uncertainty coefficient value under anatomy is only 0.20, 0.12, and 0.05 (×10-2) correspondingly.For l-diversity, it achieves zero correlation preservation when l = 3, and 5 while it perfectly achieves correlation preservation when l = 4, because the Salary attribute is generalized to “All” when l = 3, and 5, while it is unchanged across all QI-groups when l = 4. l-diversity in general cannot preserve correlation well, because it is intractable to predict which QI attributes will be generalized in l-diversity.

    Fig. 7 shows the average value of uncertainty coefficients among attributes for anatomy, l-diversity and RR-Both on the data set ESGRO. As shown in Fig. 7a, randomization achieves the best correlation preservation between the sensitive attribute and quasiidentifiers across all privacy thresholds. Fig. 7b also clearly shows that randomization better preserves correlation among quasi-identifiers than l-diversity does. Please note that anatomy can best achieve the correlation among quasi-identifiers,since it does not change values of quasi-identifiers.

      >  C. Evaluation Summary

    In summary, the evaluation shows that randomization can better preserve utility (in terms of distribution reconstruction,accuracy of aggregate query answering, and correlation among attributes) under the same privacy requirements. Utility loss is significantly smaller than that of generalization or anatomy approaches. Furthermore, the effectiveness of randomization can be further improved when more data are available. The evaluation also showed that the randomization approach can further improve the accuracy of the reconstructed distribution(and hence utility preservation) when more data are available,while generalization and anatomy approaches do not have this property.

    VI. CONCLUSION

    In this paper, we investigated attribute disclosure in the case of linking attacks. We compared randomization to other anonymization schemes (l-diversity and anatomy) in terms of utility preservation, with the same privacy protection requirements.Our experimental evaluations showed randomization significantly outperforms generalization, i.e., achieving better utility preservation, while yielding the same privacy protection.

    There are several other avenues for future work. We aim to extend our research to handle multiple sensitive attributes. In this paper, we limit our scope, as attackers have no knowledge about the sensitive attribute of specific individuals in the population and/or the table. In practice, this may not be true, since the attacker may have instance-level background knowledge(e.g., the attacker might know that the target individual does not have cancer; or the attacker might know complete information about some people in the table other than the target individual.)or partial demographic background knowledge about the distribution of sensitive and insensitive attributes in the population.Different forms of background knowledge have been studied in privacy preservation data publishing recently. For example, the formal language [22, 24] and maximum estimation constraints[43] are proposed to express background knowledge of attackers.We will investigate privacy disclosure under those background knowledge attacks. We should note that randomization may not outperform generalization in all cases, especially when some specific background knowledge is utilized in the adversary model, because generalization is truthful and the ambiguity is between true tuples (e.g., individual a and b are indistinguishable),whereas randomization can be regarded as adding noise,and noise can be removed when some background knowledge is known.

    We will continue our study of further reducing computation overhead of the randomization approach. The execution time of randomization is usually slower than l-diversity and anatomy.For example, the execution time of our RR-Both approach on the Adult data sets was 10 times slower than for generalization approaches due to the heavy cost of determining the optimal distortion parameters with attribute disclosure constraints, especially when the attribute domains are large. One advantage of randomization is that the reconstruction accuracy increases when more data are available. Guo et al. [10] preliminarily conducted theoretical analysis on how the randomization process affects the accuracy of various measures (e.g., support, confidence,and lift) in market basket data analysis. In our future work, we will study the accuracy of reconstruction in terms of bias and variance of estimates in randomizing general microdata.

    Finally, we are interested in comparing our randomization approaches to most recently developed generalization approaches(e.g., the accuracy-constrained l-diversification method in [14]).We are also interested in combining the derived optimal randomization schemes with the small domain randomization [38]to further improve utility preservation. It is our belief that extensive studies on comparing different privacy preservation data publishing approaches are crucial.

      >  Appendix 1. Proof of Property 1

    We start with applying RR to the sensitive attribute, Pr( r =QI^r) = 1. Without loss of generality, we assume that Sr = 1. Combine other categories 0, 2, . . . , ds - 1 into a new categories, and still use 0 to denote the new category. To make the notation simple,we simply write

    image

    as π0 in this proof, then π1 =1 ? π0. Such adjustment does not change the reconstruction probability Pr( S?r = Sr = 1'QIr). After the adjustment, the randomization probabilities are given by

    image

    By definition, the posterior probabilities are given by

    image
    image

    Combining (13), (14), and (15), we have

    image

    Taking the derivative with respect to ps, we have (16) is minimized when ps = 1/ds , and the minimal value is

    image

    Following similar strategies, we can prove the general case when we randomize both QI and S.

      >  Appendix 2. Proof of Result 2

    LEMMA 3. If d( π^ , π) = '' π^ π''22 , we have E[d( π^, π)] = trace(Σ), where Σ is the covariance matrix of π^ shown in (2).

    PROOF. We know that π^ asymptotically follows the normal distribution N(π, Σ). Let Σ = XΛXT be the eigen-decomposition of Σ, where Λ = diag(λ1, . . . , λn) and XTX = I. Let η = XT^ π), then η is normally distributed with E(η) = 0 and

    Cov(η) = XT Cov( π^ π)X = XTΣX = Λ.

    Notice that Λ is a diagonal matrix, and hence Cov(ηiηj) = 0 if i ≠ j, and Var(ηi) = λi, i.e., ηi independently follows the normal distribution N(0, λi). Therefore, we have:

    image

    The last equality is due to the fact that

    trace(Σ) = trace(XΛXT) = trace(ΛXTX) = trace(Λ).

    We proved the lemma.

    LEMMA 4. Let Pi be the randomization matrix specified in(1). When pi ≠ 1/di , we have

    image

    where 1 is the column vector, whose cells all equal 1.

    PROOF. We can re-write Pi as follows:

    image

    With

    image

    and the binomial inverse theorem[45], we can immediately get Pi-1, and ''Pi-1 F ''2 can be directly calculated from Pi-1 .

    LEMMA 5. Let (P-1)i denote the i-th column (or row) of P-1.Then, for any i = 1, 2, . . . , D, we have

    image

    PROOF. With Lemma 4, we observe that every row (or column)of Pi-1has the same components, except for a change of order. Since P-1 = Pi-1 ? · · · ? Ps-1 , we can also conclude that all rows (or columns) of P-1 have the same components,except for a change of order. Therefore, for all i’s,

    image

    Next, we prove the main result. With Lemma 3, when the distance is the squared Euclidean distance, to minimize E[d(π? ,π)] is equivalent to minimize trace(Σ). With (2), we have

    image

    Notice that the constraint function is an increasing function of pi, the optimal solution must occur when the equality stands,and we have proved the result.

  • 1. Samarati P, Sweeney L 1998 Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression [Proceedings of the IEEE Symposium on Research in Security and Privacy] google
  • 2. Machanavajjhala A, Gehrke J, Kiefer D, Venkitasubramanian M 2006 L-diversity: privacy beyond k-anonymity [Proceedings of the 22nd International Conference on Data Engineering] google
  • 3. LeFevre K, DeWitt D. J, Ramakrishnan R 2005 Incognito: efficient full-domain K-anonymity [ACM SIGMOD International Conference on Management of Data] P.49-60 google
  • 4. Zhang Q, Koudas N, Srivastava D, Yu T 2007 Aggregate query answering on anonymized tables [Proceedings of the 23rd International Conference on Data Engineering] P.116-125 google
  • 5. Xiao X, Tao Y 2006 Anatomy: simple and effective privacy preservation [Proceedings of the 32nd International Conference on Very Large Data Bases] P.139-150 google
  • 6. Lambert D 1993 Measures of disclosure risk and harm [Journal of Official Statistics] Vol.9 P.313-331 google
  • 7. Li N, Li T, Venkatasubramanian S 2007 T-closeness: privacy beyond k-anonymity and l-diversity [Proceedings of the 23rd International Conference on Data Engineering] P.106-115 google
  • 8. Du W, Zhan Z 2003 Using randomized response techniques for privacy-preserving data mining [Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.505-510 google
  • 9. Rizvi S. J, Haritsa J. R 2002 Maintaining data privacy in association rule mining [Proceedings of the 28th International Conference on Very Large Data Bases] P.682-693 google
  • 10. Guo L, Guo S, Wu X 2007 Privacy preserving market basket data analysis [he 11th European Conference on Principles and Practice of Knowledge Discovery in Databases] P.103-114 google
  • 11. Guo L, Wu X 2009 Privacy preserving categorical data analysis with unknown distortion parameters [Transactions on Data Privacy] Vol.2 P.185-205 google
  • 12. Teng Z, Du W 2006 Comparisons of k-anonymization and randomization schemes under linking attacks [Proceedings of the 6th International Conference on Data Mining] P.1091-1096 google
  • 13. Aggarwal G, Feder T, Kenthapadi K, Motwani R, Panigrahy R, Thomas D, Zhu A 2005 Approximation algorithms for kanonymity [Journal of Privacy Technology] google
  • 14. Ghinita G, Karras P, Kalnis P, Mamoulis N 2009 A framework for efficient data anonymization under privacy and accuracy constraints [ACM Transactions on Database Systems] Vol.34 P.9:1-9:47 google
  • 15. Kifer D, Gehrke J 2006 Injecting utility into anonymized datasets [ACM SIGMOD International Conference on Management of Data] P.217-228 google
  • 16. Koudas N, Srivastava D, Yu T, Zhang Q 2009 Distributionbased microdata anonymization [Proceedings of the 35th International Conference on Very Large Data Bases] google
  • 17. Narayanan A, Shmatikov V 2008 Robust de-anonymization of large sparse datasets [IEEE Symposium on Security and Privacy] P.111-125 google
  • 18. Brickell J, Shmatikov V 2008 The cost of privacy: destruction of data-mining utility in anonymized data publishing [The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.70-78 google
  • 19. Li T, Li N 2009 On the tradeoff between privacy and utility in data publishing [The 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.517-525 google
  • 20. Truta T. M, Vinay B 2006 Privacy protection: p-sensitive k-anonymity property [Proceedings of the 22nd IEEE Internationl Conference on Data Engineering] P.94 google
  • 21. Wong R. C. W, Li J, Fu A. W. C, Wang K 2006 (α k)-anonymity:an enhanced k-anonymity model for privacy-preserving data publishing [The 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.754-759 google
  • 22. Martin D. J, Kifer D, Machanavajjhala A, Gehrke J, Halpern J. Y 2007 Worst-case background knowledge for privacy-preserving data publishing [Proceedings of the 23rd International Conference on Data Engineering] P.126-135 google
  • 23. Wong R. C. W, Fu A. W. C, Wang K, Pei J 2007 Minimality attack in privacy preserving data publishing [Proceedings of the 33nd International Conference on Very Large Data Bases] P.543-554 google
  • 24. Chen B, Lefevre K, Ramakrishnan R 2007 Privacy skyline:privacy with multidimensional adversarial knowledge [Proceedings of the 33nd International Conference on Very Large Data Bases] P.770-781 google
  • 25. Nergiz M. E, Atzori M, Clifton C 2007 Hiding the presence of individuals from shared databases [ACM SIGMOD International Conference on Management of Data] P.665-676 google
  • 26. Li J, Tao Y, Xiao K 2008 Preservation of proximity privacy in publishing numerical sensitive data [ACM SIGMOD International Conference on Management of Data] P.473-485 google
  • 27. Wei Q, Lu Y, Lou Q 2008 (t λ)-uniqueness: anonymity management for data publication [The 7th IEEE/ACIS International Conference on Computer and Information Science] P.107-112 google doi
  • 28. Warner S. L 1965 Randomized response: a survey technique for eliminating evasive answer bias [Journal of the American Statistical Association] Vol.60 P.63-66 google doi
  • 29. Agrawal R, Srikant R 2000 Privacy-preserving data mining [ACM SIGMOD International Conference on Management of Data] P.439-450 google
  • 30. Aggarwal C. C, Yu P. S 2008 A survey of randomization methods for privacy-preserving data mining [Privacy-Preserving Data Mining: Models and Algorithms] P.137-156 google
  • 31. Agrawal S, Haritsa J. R 2005 A framework for high-accuracy privacy-preserving mining [Proceedings of the 21st International Conference on Data Engineering] P.193-204 google
  • 32. Aggarwal C. C 2008 On unifying privacy and uncertain data models [Proceedings of the 24th International Conference on Data Engineering] P.386-395 google
  • 33. Zhu Y, Liu L 2004 Optimal randomization for privacy preserving data mining [The 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.761-766 google
  • 34. Rebollo-Monedero D, Forne J, Domingo-Ferrer J 2010 From tcloseness-like privacy to postrandomization via information theory [IEEE Transactions on Knowledge and Data Engineering] Vol.22 P.1623-1636 google doi
  • 35. Gouweleeuw J. M, Kooiman P, Willenborg L. C. R. J, de Wolf P. P 1998 Post randomisation for statistical disclosure control:theory and implementation [Journal of Official Statistics] Vol.14 P.463-478 google
  • 36. Huang Z, Du W 2008 OptRR: optimizing randomized response schemes for privacy-preserving data mining [Proceedings of the 24th International Conference on Data Engineering] P.705-714 google
  • 37. Xiao X, Tao Y, Chen M 2009 Optimal random perturbation at multiple privacy levels [Proceedings of the 35th International Conference on Very Large Data Bases] P.814-825 google
  • 38. Chaytor R, Wang K 2010 Small domain randomization: same privacy more utility [Proceedings of the 36th International Conference on Very Large Data Bases] P.608-618 google
  • 39. Chaudhuri A, Mukerjee R Randomized Response: Theory and Techniques google
  • 40. Coleman T. F, Liu J, Yuan W 2002 A new trust-region algorithm for equality constrained optimization [Computational Optimization and Applications] Vol.21 P.177-199 google doi
  • 41. Asuncion A, Newman D. J UCI machine learning repository google
  • 42. Samarati P 2001 Protecting respondents’ identities in microdata release [IEEE Transactions on Knowledge and Data Engineering] Vol.13 P.1010-1027 google doi
  • 43. Du W, Teng Z, Zhu Z 2008 Privacy-MaxEnt: integrating background knowledge in privacy quantification [ACM SIGMOD International Conference on Management of Data] P.459-472 google
  • 44. Guo L, Ying X, Wu X 2010 On attribute disclosure in randomization based privacy preserving data publishing [Proceedings of the 10th IEEE International Conference on Data Mining Workshops] P.466-473 google doi
  • 45. Strang G 2003 Introduction to Linear Algebra google
  • [Table 1.] 2 × 3 contingency tables for two variables gender (QI) disease(sensitive)
    2 × 3 contingency tables for two variables gender (QI) disease(sensitive)
  • [Fig. 1.] Randomization based privacy-preserving data publishing.
    Randomization based privacy-preserving data publishing.
  • [Fig. 2.] Pr(Sr'QIr) vs. randomization parameters. QI: quasiidentifier.
    Pr(Sr'QIr) vs. randomization parameters. QI: quasiidentifier.
  • [Table 2.] Description of the Adult data set used in the evaluation
    Description of the Adult data set used in the evaluation
  • [Table 3.] Randomization parameters pi for three cases of RR (data setEMGRW)
    Randomization parameters pi for three cases of RR (data setEMGRW)
  • [Fig. 3.] Distances between and π for three scenarios of RR (data set E π MGRW). RR: randomized response QI: quasiidentifier.
    Distances between and π for three scenarios of RR (data set E π MGRW). RR: randomized response QI: quasiidentifier.
  • [Fig. 4.] For RR-Both distances between π and π decrease as the data set size increases (data set EMGRW). RR: randomized response QI: quasiidentifier.
    For RR-Both distances between π and π decrease as the data set size increases (data set EMGRW). RR: randomized response QI: quasiidentifier.
  • [Fig. 5.] Distances between and π for anatomy l-diversity and RR-QI (π randomized response-quasiidentifier; data set ESGRO).
    Distances between and π for anatomy l-diversity and RR-QI (π randomized response-quasiidentifier; data set ESGRO).
  • [Fig. 6.] Relative errors of queries for anatomy l-diversity and RR-QI (randomized response-quasiidentifier; data set ESGRO).
    Relative errors of queries for anatomy l-diversity and RR-QI (randomized response-quasiidentifier; data set ESGRO).
  • [Table 4.] Variation of correlation (uncertainty coefficient) between pairs ofquasiidentifier (QI) attributes under different models (×10-2) (data set ESGRO)
    Variation of correlation (uncertainty coefficient) between pairs ofquasiidentifier (QI) attributes under different models (×10-2) (data set ESGRO)
  • [Table 5.] Variation of correlation (uncertainty coefficient) betweenquasiidentifier (QI) and S under different models (×10-2) (data set ESGRO)
    Variation of correlation (uncertainty coefficient) betweenquasiidentifier (QI) and S under different models (×10-2) (data set ESGRO)
  • [Fig. 7.] Average value of uncertainty coefficients among attributes for anatomy l-diversity and RR-QI (randomized response-quasiidentifier; data setESGRO).
    Average value of uncertainty coefficients among attributes for anatomy l-diversity and RR-QI (randomized response-quasiidentifier; data setESGRO).