# Pseudonymization
## In Brief
**Pseudonymisation** aims to substitute one or more identifiers that link(s) the identity of an individual to its data with a surrogate value, called **pseudonym** or **token**.
## More in detail
To preserve a subject's privacy, one of the most basic methodology is to
de-couple the identity of said subject from its data. This is process is
called pseudonymisation. The typical practical approach to achieve
pseudonymity is to detect which attributes in the data may reveal the
subject's identity, called *personal identifiers*, and substitute them
with some other value.
However, re-identification may be needed in certain cases (for example, to contact data subject for further questions), therefore
personal identifiers are often maintained for re-associating subject and
identity. This association should be secured and inaccessible to anybody
having access tho the pseudonymised data, so that protection is
guaranteed.
Following the description in the Article 4(5) of the
European General Data Protection Regulation (GDPR) {cite}`gdpr`:
> "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person".
This definition indicates that the additional information needed to
actually link a subject's identity to it's data should be the focus of
pseudonymisation techniques. Indeed, pseudonymization reduces the risk of publishing data due to a direct re-identification (See {doc}`./L2.reidentification`).
Recital 28 of the GDPR, states that "explicit introduction of pseudonymisation in this Regulation is not intended to preclude any other measures of data protection". Article 6(4) of the GDPR also reports that pseudonymisation could be an "appropriate safeguards" and that data controller should operate "pseudonymising personal data as soon as possible" (Recital 78 GDPR) and implementing "appropriate technical and organisational measures, such as pseudonymisation" (Article 25(1) GDPR) both at the time of the determination of the means for processing and at the time of the processing itself.
The pseudonym must be distinguishable and irreversible in the absence of additional information. This means that it should not be possible to reconstruct the original value by just considering the pseudonym, i.e., there does not exist any function that computes the original value with the pseudonym as input. The correspondence between original value and pseudonym must be stored in a separate location and must be secured against data breaches. Surrogate values need also to be managed after the generation, either internally or externally. In the latter case, the institution who owns data outsources this service to a qualified (and trusted) third party.
### Pseudonymisation techniques
There are several techniques that perform pseudonymisation. They can be
generally summarized in three main categories:
1. **Cryptographic with secret key**: these techniques use mathematical
mechanism to alter the original value through the application of a
secret *key*. This key is at the core of the mechanism: with it, the
pseudonymisation can be reversed, so the key has to be secured at
all times.
2. **Hash-based**: these techniques use a function that, given an
identifier (composed by one or more attributes) with arbitrary
length returns a value of fixed size (e.g., size 256 bits, which correspond to
32 characters), being called hash value or message digest. The hash
function is usually a deterministic function and must be
irreversible, i.e., for any input of the function it is infeasible
to compute the inverse function from the output. Functions typically
used for hashing are *SHA-2* {cite}`SHA2` and *SHA-3* {cite}`SHA3`, for example
the *SHA3-512* which has output values of length 512 bits.
3. **Keyed-hash based**: a combination of the previous techniques where
the hash function requires a key, called *salt* {cite}`Hmac`, to compute
its output. This is generally considered a more robust approach that
simple hashing. Varying the key, the same data subject's identifier
can be translated in several different pseudonyms. In cryptography
literature, these are referred to as *message authentication codes*
{cite}`cripto`. This family of techniques is more robust against some
brute-force attacks, especially if the salt is changed sufficiently
often.
4. **Keyed-hash function with deletion**: equal to the previous one,
but after the generation of the pseudonym, the correspondence table
is deleted, i.e., we cannot associate again pseudonyms to personal
identifiers.
5. **Tokenization**: the idea of tokenization is to substitute the
subjects' identifiers with a token generated with some cryptographic
methods. However, tokenization is a non-mathematical approach: data
is replaced but the type or length is not altered. Typically
knowledge of a token has no usefulness for a third party. Another
difference is that tokenization is fast and can be done with few
computational resources.{cite}`tokens`
## Bibliography
```{bibliography} ./references.bib
:style: unsrt
:filter: docname in docnames
```
> This entry was readapted from *Comandé et al. Elgar Encyclopedia of Law and Data Science. Edward Elgar Publishing (2022) ISBN: 978 1 83910 458 9* by Francesca Pratesi, Roberto Pellungrini, and Anna Monreale.