Re-identification Attack#

Synonyms: Linking attack, Attack on pseudonymised data

In Brief#

Re-identification attack aims to link a certain set of data related to an individual in a dataset (which does not contain direct identifiers) to a real identity, relying on additional information.

More in Detail#

A first, basic, and simple way to preserve a subject’s privacy is to de-couple the identity of said subject from its data. This is process is called Pseudonymization. The typical practical approach to achieve pseudonymity is to detect which attributes in the data may reveal the subject’s identity, called personal identifiers, and substitute them with some other value. While this process can help in reducing the risk of a direct re-identification of the data subjects based on the published data, re-identification can still be possible in certain cases.

Indeed, additional information (usually called background information) can be used by a malicious third party (often called adversary or attacker) to link a subject’s identity to its data. Additional information makes the difference between pseudonymised and anonymised data. Anonymous data are deprived of all distinctive elements of the person, i.e., those elements that permit to identify both directly or indirectly that person in the data. Anonymous data cannot be re-identified by definition, even when using additional information, therefore this type of data is not subject to the current privacy regulations (e.g., the GDPR [1]).

This key difference can be better understood with the famous real life example of the attack on the privacy of the Governor of Massachussetts. In 1996, William Floyd Weld, then Governor of Massachusetts, lost his consciousness during a public event. Rushed to the nearby Deaconess Waltham Hospital, he was officially diagnosed with influenza and consequently discharged the following day [3]. Some time later, professor Latanya Sweeney, a graduate computer science student at MIT at the time, successfully reconstructed what had happened to the governor and inferred his diagnosis linking two different data sources: a publicly available voter rolls dataset and a hospital dataset without patients’ names, thus considered anonymous [4]. The voter rolls dataset contained the name, address, ZIP code, birth date, sex and other attributes of every voter in the city of Cambridge (Middlesex County). The hospital dataset was issued to researchers by the Massachusetts Group Insurance Commission and contained diagnosis of patients along with some demographic information. Since the identity of the different patients was not present in the hospital data, the information published there was considered harmless; indeed, this is almost true if thery are considered by themselves. But Sweeney knew that the governor was admitted to the hospital, so she also knew he was present in the data. Therefore, she (in a complete legittimate way) gained access to both datasets and she intersected the demographic information in the two dataset, discoverying some important facts: six individuals in the hospital dataset shared the Governor’s birth date; only three of these were men; but only one of these men lived in the Governor’s own ZIP code.

Table 3 Cambridge Voter Roll Dataset: this table represents an extract of the voter dataset.#

Surname

Name

Date of birth

Sex

Address

ZIP code

Last vote

Weld

William Floyd

31 July 1945

M

75, Essex St

02139

22 May 1998

Welsh

Alice

4 July 1952

F

150, Main St

02139

22 May 1998

Weltcher

Bob

13 July 1947

M

148, Gold Rd

02138

22 May 1998

Table 4 Hospital Dataset: this table represents an extract of the medical dataset. Note that this table does not contain any direct identifiers, such as surnames or social security numbers.#

Id

Sex

Date of birth

ZIP code

Visit

Diagnosis

1

M

31 July 1945

02138

9 May 1996

Diabetes

2

M

31 July 1945

02139

18 May 1996

Stroke

3

F

31 July 1945

02138

18 May 1996

Osteoporosis

4

F

31 July 1945

02139

23 May 1996

Stroke

5

M

4 July 1945

02138

5 June 1996

Diabetes

6

M

13 June 1945

02139

9 June 1996

Arthritis

7

F

5 April 1945

02139

4 July 1996

Hypertension

In Table 3 and Table 4 we can see a simplified version of Sweeney’s attack. Looking at the tables singularly, no sensitive information (i.e., the diagnosis) about the Governor William Floyd Weld can be inferred. However, from Table 3 we gain access to the date of birth and ZIP code of Governor. Then, we can search for the persons born on July 31, 1945 in Table 4, finding patients number 1, 2, 3 and 4. However, ids 3 and 4 correspond to women, so we should consider only individuals 1 and 2. Finally, we can look at the ZIP code in Table 4: the patient number 1 lives in a different area, so we only have one patient (the number 2) that can be the Governor. In brief, we can see that, matching the information colored in blue from the two tables, there is no other possibility for Governor Weld but to be a patient suffering from a stroke. This was a clear breach of the privacy of the Governor, as the public statement about his health differed from the actual cause of hospitalization.

Sweeney conducted similar attacks in a more structured and generalised experiment, finding that 87% of the United States population was uniquely or nearly uniquely identified by the combination of ZIP code, gender, and date of birth [2]. This leaded Sweeney to theorize the k-anonymity principle, and call the attributes used for the re-identification process quasi-identifier.

Bibliography#

1

European Parliament & Council. General data protection regulation. 2016. L119, 4/5/2016, p. 1–88.

2

Latanya Sweeney. K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, oct 2002. URL: https://doi.org/10.1142/S0218488502001648, doi:10.1142/S0218488502001648.

3

Daniel C. Barth-Jones. The 're-identification' of governor william weld's medical information: a critical re-examination of health data identification risks and privacy protections, then and now. 2012. doi:http://dx.doi.org/10.2139/ssrn.2076397.

4

Latanya Sweeney. Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine & Ethics, 25(2-3):98–110, 1997. PMID: 11066504. URL: https://doi.org/10.1111/j.1748-720X.1997.tb01885.x, arXiv:https://doi.org/10.1111/j.1748-720X.1997.tb01885.x, doi:10.1111/j.1748-720X.1997.tb01885.x.

This entry was readapted from Comandé et al. Elgar Encyclopedia of Law and Data Science. Edward Elgar Publishing (2022) ISBN: 978 1 83910 458 9 by Francesca Pratesi, Roberto Pellungrini, and Anna Monreale.