Randomization Methods
Contents
Randomization Methods#
In Brief#
Randomization methods are used to perturb data in order to preserve the privacy of sensitive information.
More in Detail#
Randomization methods are employed to alter data in order to protect the privacy of sensitive information. They were initially used for statistical disclosure control [5] and have later been extended to the problem of privacy-preserving data mining [1]. Randomization techniques modify the data using a noise component; from the perturbed data, it is still feasible to extract patterns and models. In the literature, there are two types of random perturbation techniques: additive random perturbation and multiplicative random perturbation.
In the additive random perturbation methods, the original dataset is represented by \(X = {x_1 ... x_m}\) and the new distorted dataset, represented by \(Z = {z_1 ... z_m}\), is obtained by independently drawing a noise quantity \(n_i\) from a probability distribution (usually Uniform or Gaussian) and adding it to each record \(x_i ∈ X\).
It is important to note that both the \(m\) instances of the probability distribution \(Z\) and the distribution of the noise are known, also by the attackers. The original record values cannot be easily inferred from the distorted data, while the dataset distribution can be effectively recovered using one of the methods discussed in [1, 2]. Thus, individual records are not accessible, but it is possible to obtain distributions only along individual dimensions that describe the behavior of the original dataset \(X\).
However, traditional algorithms (especially whether they are related to machine learning) are not suitable as they rely on statistics derived from individual records or multivariate distributions. Consequently, new (machine learning) methods must be developed to work with aggregate distributions of the data to achieve mining results.
The studies presented in [1, 3, 4] proposed techniques based on the randomization approach to perturb data and then build classification models over randomized data. In [2], Agrawal and Aggarwal demonstrate that the choice of the reconstruction algorithm impacts the accuracy of the original probability distribution. Furthermore, they propose a method that converges to the maximum likelihood estimate of the data distribution. The authors of [3, 4] introduce methods to construct a Naive Bayesian classifier over perturbed data. Randomization approaches are also applied to solve the privacy-preserving association rules mining problem, as seen in {cite]Evfimievski,Rizvi
.
For privacy-preserving data mining and machine learning, multiplicative random perturbation techniques can also be utilized. The primary techniques of multiplicative perturbation are based on the work presented in [5].
Weakness of randomization models#
Unfortunately, the main issue of randomization methods is that they are not secure against attacks with prior knowledge. In fact, in the work [6], Kargupta et al. show that the original data matrix can be reconstructed from a randomized data matrix using the random matrix-based spectral filtering technique.
In order to overcome this problem, Differential Privacy, which is a novel schema of randomization algorithms, has been proposed.
Bibliography#
- 1(1,2,3)
Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In SIGMOD Conferences. 2000.
- 2(1,2)
Dakshi Agrawal and Charu C. Aggarwal. On the design and quantification of privacy preserving data mining algorithms. In 20th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 2001.
- 3(1,2)
Justin Z. Zhan, Stan Matwin, and LiWu Chang. Privacy-preserving collaborative association rule mining. In DBSec. 2005.
- 4(1,2)
Peng Zhang, Yunhai Tong, Shiwei Tang, and Dongqing Yang. Privacy preserving naive bayes classification. In ADMA. 2005.
- 5
J. Lindenstrauss and W. Johnson. Extensions of lipshitz mapping into hilbert space. Contemporary Math., 1984.
- 6
Hillol Kargupta, Souptik Datta, Qi Wang, and Krishnamoorthy Sivakumar. On the privacy preserving properties of random data perturbation techniques. In 3rd IEEE International Conference on Data Mining. 2003.
This entry was written by Francesca Pratesi.