k-anonymity
Contents
k-anonymity#
In Brief#
k-anonimity (and the whole family of anonymity by indistinguishability models) is based on comparison among individuals present in data, and it aims to make each individual so similar as to be indistinguishable from at least k-1 others.
More in Detail#
One of the most common models for anonymization is k-anonymity, where a subject is considered anonymous if an adversary cannot achieve identification of her within a set of other k-1 subjects. This set of k individuals is called the anonymity set. Identification, in this context, means the ability to distinguish the subject from other individuals with whom his/her data are grouped. This definition of anonymity is implicitly quantifiable as it depends from the size of the anonymity set. Clearly, a bigger anonymity set implies a better guarantee regarding indistinguishability. For example, the anonymity of a subject with an anonymity set of size 30 is better than the one that can be achieved with an anonymity set of size 5.
The k-anonymity framework was originally applied on relational tables [1]. The basic assumption is that attributes are partitioned in quasi-identifiers and sensitive attributes [2]. The quasi-identifiers are attributes that can be linked to external information to re-identify the individual to whom the information refers (so-called Re-identification Attack). Usually, they are publicly known or easy to obtain with a superficial knowledge of the subject: for example, age, zip-code, and gender are classic quasi identifiers. The sensitive attributes, instead, represent the information to be protected. Indeed, a dataset satisfies the property of k-anonymity if each released record has at least k−1 other records also visible in the release whose values are indistinct over the quasi-identifiers.
In k-anonymity techniques, strategies such as generalization and suppression are usually applied to reduce the granularity of representation of quasi-identifiers. It is clear that these methods guarantee privacy but also reduce the accuracy of applications on the transformed data. Indeed, one of the main challenge of k-anonymity is to find the minimum level of changes (in terms of generalization or suppression) that allows us to guarantee high privacy and good data precision. In Table 2, one can find an example of a privacy mitigation through generalization that allows to reach 3-anonymity, i.e., k-anonymity with 3 as the minimum size of each anonymity set, starting from the situation depicted in Table 1.
Pseudo-id |
Gender |
Date of Birth |
ZIP code |
Disease |
---|---|---|---|---|
1000 |
F |
5 June 1975 |
02108 |
Lyme Disease |
1001 |
F |
15 May 1973 |
01970 |
Hypertension |
1002 |
F |
3 December 1977 |
02657 |
Vertigo |
1003 |
M |
5 September 1941 |
10238 |
Stroke |
1004 |
M |
15 June 1947 |
10042 |
Stroke |
1005 |
M |
25 April 1946 |
10133 |
Stroke |
1006 |
F |
25 December 1942 |
10053 |
Diabetes |
1007 |
F |
5 October 1949 |
10053 |
Osteoporosis |
1008 |
F |
6 July 1946 |
10053 |
Arthritis |
Pseudo-id |
Gender |
Date of Birth |
ZIP code |
Disease |
---|---|---|---|---|
1000 |
F |
[1970-1979] |
0**** |
Lyme Disease |
1001 |
F |
[1970-1979] |
0**** |
Hypertension |
1002 |
F |
[1970-1979] |
0**** |
Vertigo |
1003 |
M |
[1940-1949] |
10*** |
Stroke |
1004 |
M |
[1940-1949] |
10*** |
Stroke |
1005 |
M |
[1940-1949] |
10*** |
Stroke |
1006 |
F |
[1940-1949] |
10053 |
Diabetes |
1007 |
F |
[1940-1949] |
10053 |
Osteoporosis |
1008 |
F |
[1940-1949] |
10053 |
Arthritis |
Weakness of k-anonymity model#
Unfortunately, the k-anonymity framework can be vulnerable in some cases. In particular, it is not safe against homogeneity attack and background knowledge attack. The homogeneity attack exploits a possible lack of variety in the sensitive attributes. Indeed, an adversary can infer the value of the sensitive attributes when a k-anonymous dataset contains a group of k entries with the same value for the sensitive attributes. As an example, suppose that the attacker is searching for Bob in the dataset presented in Table 2: the adversary knows that Bob was born in July 1946. Even if the attacker is not able to discriminate Bob between patients 1003, 1004, and 1005, the disease associated with all these three patients is a stroke; thus, the adversary cannot re-identify Bob, but he/she can still infer that Bob suffers from heart disease. In a background knowledge attack, instead, an attacker knows information useful to associate some quasi-identifiers with some sensitive attributes: as an example, remaining in the medical domain, it is common knowledge that certain diseases are more frequent in a specific gender.
Bibliography#
- 1
Latanya Sweeney. Uniqueness of simple demographics in the U.S. population. 2000. LIDAP-WP4.
- 2
Latanya Sweeney. K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, oct 2002. URL: https://doi.org/10.1142/S0218488502001648, doi:10.1142/S0218488502001648.
This entry was readapted from Comandé et al. Elgar Encyclopedia of Law and Data Science. Edward Elgar Publishing (2022) ISBN: 978 1 83910 458 9 by Francesca Pratesi, Roberto Pellungrini, and Anna Monreale.