k-anonymity#

In Brief#

k-anonimity (and the whole family of anonymity by indistinguishability models) is based on comparison among individuals present in data, and it aims to make each individual so similar as to be indistinguishable from at least k-1 others.

More in Detail#

One of the most common models for anonymization is k-anonymity, where a subject is considered anonymous if an adversary cannot achieve identification of her within a set of other k-1 subjects. This set of k individuals is called the anonymity set. Identification, in this context, means the ability to distinguish the subject from other individuals with whom his/her data are grouped. This definition of anonymity is implicitly quantifiable as it depends from the size of the anonymity set. Clearly, a bigger anonymity set implies a better guarantee regarding indistinguishability. For example, the anonymity of a subject with an anonymity set of size 30 is better than the one that can be achieved with an anonymity set of size 5.

The k-anonymity framework was originally applied on relational tables [1]. The basic assumption is that attributes are partitioned in quasi-identifiers and sensitive attributes [2]. The quasi-identifiers are attributes that can be linked to external information to re-identify the individual to whom the information refers (so-called Re-identification Attack). Usually, they are publicly known or easy to obtain with a superficial knowledge of the subject: for example, age, zip-code, and gender are classic quasi identifiers. The sensitive attributes, instead, represent the information to be protected. Indeed, a dataset satisfies the property of k-anonymity if each released record has at least k−1 other records also visible in the release whose values are indistinct over the quasi-identifiers.

In k-anonymity techniques, strategies such as generalization and suppression are usually applied to reduce the granularity of representation of quasi-identifiers. It is clear that these methods guarantee privacy but also reduce the accuracy of applications on the transformed data. Indeed, one of the main challenge of k-anonymity is to find the minimum level of changes (in terms of generalization or suppression) that allows us to guarantee high privacy and good data precision. In Table 2, one can find an example of a privacy mitigation through generalization that allows to reach 3-anonymity, i.e., k-anonymity with 3 as the minimum size of each anonymity set, starting from the situation depicted in Table 1.

Table 1 A potential extract from a medical dataset. Gender, date of birth, and ZIP code are the quasi-identifiers, while the Disease is the sensitive attribute. An adversary knowing that Alice was born in 1974 and lives in Boston, MA, who gains access to this dataset, will discover that Alice is the patient number 1000 and that she has Lyme Disease.#
Pseudo-id	Gender	Date of Birth	ZIP code	Disease
1000	F	5 June 1975	02108	Lyme Disease
1001	F	15 May 1973	01970	Hypertension
1002	F	3 December 1977	02657	Vertigo
1003	M	5 September 1941	10238	Stroke
1004	M	15 June 1947	10042	Stroke
1005	M	25 April 1946	10133	Stroke
1006	F	25 December 1942	10053	Diabetes
1007	F	5 October 1949	10053	Osteoporosis
1008	F	6 July 1946	10053	Arthritis

Table 2 The 3-anonymous version of the dataset shown in Table 1. The gender attribute is maintained as is, while the date of birth was replaced by an interval of years and the precision of the ZIP code was reduced. An adversary knowing the same information described in Table 1 cannot be sure if Alice is the patient number 1000, 1001 or 1002. Indeed, now each patient is included in an anonymity set of at least 3 individuals.#
Pseudo-id	Gender	Date of Birth	ZIP code	Disease
1000	F	[1970-1979]	0****	Lyme Disease
1001	F	[1970-1979]	0****	Hypertension
1002	F	[1970-1979]	0****	Vertigo
1003	M	[1940-1949]	10***	Stroke
1004	M	[1940-1949]	10***	Stroke
1005	M	[1940-1949]	10***	Stroke
1006	F	[1940-1949]	10053	Diabetes
1007	F	[1940-1949]	10053	Osteoporosis
1008	F	[1940-1949]	10053	Arthritis

Weakness of k-anonymity model#

Unfortunately, the k-anonymity framework can be vulnerable in some cases. In particular, it is not safe against homogeneity attack and background knowledge attack. The homogeneity attack exploits a possible lack of variety in the sensitive attributes. Indeed, an adversary can infer the value of the sensitive attributes when a k-anonymous dataset contains a group of k entries with the same value for the sensitive attributes. As an example, suppose that the attacker is searching for Bob in the dataset presented in Table 2: the adversary knows that Bob was born in July 1946. Even if the attacker is not able to discriminate Bob between patients 1003, 1004, and 1005, the disease associated with all these three patients is a stroke; thus, the adversary cannot re-identify Bob, but he/she can still infer that Bob suffers from heart disease. In a background knowledge attack, instead, an attacker knows information useful to associate some quasi-identifiers with some sensitive attributes: as an example, remaining in the medical domain, it is common knowledge that certain diseases are more frequent in a specific gender.

Bibliography#

1: Latanya Sweeney. Uniqueness of simple demographics in the U.S. population. 2000. LIDAP-WP4.
2: Latanya Sweeney. K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, oct 2002. URL: https://doi.org/10.1142/S0218488502001648, doi:10.1142/S0218488502001648.

This entry was readapted from Comandé et al. Elgar Encyclopedia of Law and Data Science. Edward Elgar Publishing (2022) ISBN: 978 1 83910 458 9 by Francesca Pratesi, Roberto Pellungrini, and Anna Monreale.

The TAILOR Handbook of Trustworthy AI

k-anonymity

Contents

k-anonymity#

In Brief#

More in Detail#

Weakness of k-anonymity model#

Bibliography#