t-closeness
Contents
t-closeness#
In Brief#
t-closeness aims to maintain the distribution of sensitive attributes in the anonymity by indistinguishability paradigm, ensuring that the distance between the two distributions (the original and the private ones) should be limited by a threshold t.
More in Detail#
t-closeness [1] was proposed to overcome one of the weakness of k-anonymity against the backgound knowledge attack. In a background knowledge attack, an attacker knows information useful to associate some quasi-identifiers with some sensitive attributes: as an example, in the medical domain, it is common knowledge that certain diseases are more frequent in a specific gender. Or, again, it is a well-known information that Asian people are less suscettible to heart diseases.
Indeed, suppose to publish the 3-anonymous dataset depicted in Table 4, and suppose that an attacker is searching for Umeko, a 20 years old Japanese, living in the 300112 area. The anonymity set containing Umeko is also l-diverse (see l-diversity. However, it is well-known that Japanese have an extremely low incidence of heart disease. Thus, the adversary can exploit this statistical information to infer, with a good probability, that Umeko suffers from a viral infection.
Pseudo-id |
Gender |
Date of Birth |
ZIP code |
Disease |
---|---|---|---|---|
1000 |
M |
[2000-2010] |
30011* |
Lyme Disease |
1001 |
M |
[2000-2010] |
30011* |
Viral Infection |
1002 |
M |
[2000-2010] |
30011* |
Stroke |
1003 |
F |
[2000-2010] |
30011* |
Stroke |
1004 |
F |
[2000-2010] |
30011* |
Stroke |
1005 |
F |
[2000-2010] |
30011* |
Stroke |
1006 |
F |
[2000-2010] |
30011* |
Viral Infection |
In order to protect people in a dataset with the t-closeness mechanism, indeed, we need to guarantee that the distribution of a sensitive attribute in any equivalence class must be close to the distribution of the attribute in the overall dataset (i.e., the distance between the two distributions should be no more than a threshold t)