# Pseudonymization ## In Brief **Pseudonymisation** aims to substitute one or more identifiers that link(s) the identity of an individual to its data with a surrogate value, called **pseudonym** or **token**. ## More in detail To preserve a subject's privacy, one of the most basic methodology is to de-couple the identity of said subject from its data. This is process is called pseudonymisation. The typical practical approach to achieve pseudonymity is to detect which attributes in the data may reveal the subject's identity, called *personal identifiers*, and substitute them with some other value. However, re-identification may be needed in certain cases (for example, to contact data subject for further questions), therefore personal identifiers are often maintained for re-associating subject and identity. This association should be secured and inaccessible to anybody having access tho the pseudonymised data, so that protection is guaranteed. Following the description in the Article 4(5) of the European General Data Protection Regulation (GDPR) {cite}`gdpr`: > "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person". This definition indicates that the additional information needed to actually link a subject's identity to it's data should be the focus of pseudonymisation techniques. Indeed, pseudonymization reduces the risk of publishing data due to a direct re-identification (See {doc}`./L2.reidentification`). Recital 28 of the GDPR, states that "explicit introduction of pseudonymisation in this Regulation is not intended to preclude any other measures of data protection". Article 6(4) of the GDPR also reports that pseudonymisation could be an "appropriate safeguards" and that data controller should operate "pseudonymising personal data as soon as possible" (Recital 78 GDPR) and implementing "appropriate technical and organisational measures, such as pseudonymisation" (Article 25(1) GDPR) both at the time of the determination of the means for processing and at the time of the processing itself. The pseudonym must be distinguishable and irreversible in the absence of additional information. This means that it should not be possible to reconstruct the original value by just considering the pseudonym, i.e., there does not exist any function that computes the original value with the pseudonym as input. The correspondence between original value and pseudonym must be stored in a separate location and must be secured against data breaches. Surrogate values need also to be managed after the generation, either internally or externally. In the latter case, the institution who owns data outsources this service to a qualified (and trusted) third party. ### Pseudonymisation techniques There are several techniques that perform pseudonymisation. They can be generally summarized in three main categories: 1. **Cryptographic with secret key**: these techniques use mathematical mechanism to alter the original value through the application of a secret *key*. This key is at the core of the mechanism: with it, the pseudonymisation can be reversed, so the key has to be secured at all times. 2. **Hash-based**: these techniques use a function that, given an identifier (composed by one or more attributes) with arbitrary length returns a value of fixed size (e.g., size 256 bits, which correspond to 32 characters), being called hash value or message digest. The hash function is usually a deterministic function and must be irreversible, i.e., for any input of the function it is infeasible to compute the inverse function from the output. Functions typically used for hashing are *SHA-2* {cite}`SHA2` and *SHA-3* {cite}`SHA3`, for example the *SHA3-512* which has output values of length 512 bits. 3. **Keyed-hash based**: a combination of the previous techniques where the hash function requires a key, called *salt* {cite}`Hmac`, to compute its output. This is generally considered a more robust approach that simple hashing. Varying the key, the same data subject's identifier can be translated in several different pseudonyms. In cryptography literature, these are referred to as *message authentication codes* {cite}`cripto`. This family of techniques is more robust against some brute-force attacks, especially if the salt is changed sufficiently often. 4. **Keyed-hash function with deletion**: equal to the previous one, but after the generation of the pseudonym, the correspondence table is deleted, i.e., we cannot associate again pseudonyms to personal identifiers. 5. **Tokenization**: the idea of tokenization is to substitute the subjects' identifiers with a token generated with some cryptographic methods. However, tokenization is a non-mathematical approach: data is replaced but the type or length is not altered. Typically knowledge of a token has no usefulness for a third party. Another difference is that tokenization is fast and can be done with few computational resources.{cite}`tokens` ## Bibliography ```{bibliography} ./references.bib :style: unsrt :filter: docname in docnames ``` > This entry was readapted from *Comandé et al. Elgar Encyclopedia of Law and Data Science. Edward Elgar Publishing (2022) ISBN: 978 1 83910 458 9* by Francesca Pratesi, Roberto Pellungrini, and Anna Monreale.