Distributional shift#

Synonyms: Data shift.

In brief#

Once trained, most machine learning systems operate on static models of the world that have been built from historical data which have become fixed in the systems’ parameters. This freezing of the model before it is released ‘into the wild’ makes its accuracy and reliability especially vulnerable to changes in the underlying distribution of data. When the historical data that have crystallised into the trained model’s architecture cease to reflect the population concerned, the model’s mapping function will no longer be able to accurately and reliably transform its inputs into its target output values. These systems can quickly become prone to error in unexpected and harmful ways. In all cases, the system and the operators must remain vigilant to the potentially rapid concept drifts that may occur in the complex, dynamic, and evolving environments in which your AI project will intervene. Remaining aware of these transformations in the data is crucial for safe AI. 1

More in detail#

A common use case of machine learning in real world settings is to learn a model from historical data and then deploy the model on future unseen examples. When the data distribution for the future examples differs from the historical data distribution (i.e., the joint distribution of inputs and outputs differs between training and test o deployment stages), machine learning techniques that depend precariously on the i.i.d. assumption tend to fail. This phenomena is call distributional shift and is a very common problem [3]. Note that a particular case of distributional shift occurs when only the input distribution changes (covariate shift) or there is a shift in the target variable (prior probability shift).

The problem of distributional shift is of relevance not only to academic researchers, but to the machine learning community at large. Distributional shift is present in most practical applications, for reasons ranging from the bias introduced by experimental design to the irreproducibility of the testing conditions at training time. An example is email spam filtering, which may fail to recognise spam that differs in form from the spam the automatic filter has been built on [4], yet often the model being highly confident in its erroneous classifications. This issue is especially important in high-risk applications of machine learning, such as finance, medicine, and autonomous vehicles, where a mistake may incur financial or reputational loss, or possible loss of life. It is therefore important to assess both a model’s robustness to distribution shift and its estimates of predictive uncertainty, which enable it to detect distributional shifts [1, 5].

In general, the greater the degree of shift, the poorer the model’s performance is. The performance of learned models tend to drop significantly even with a tiny amount of distribution shift between training and test [6, 7], which makes it challenging to reliably deploy machine learning in real world applications. Although one can always increase training coverage by adding more sources of data [8], data augmentation [9, 10], or injecting structural bias into models [11, 12, 13] for generalisation to any potential input for the learned model, it is unrealistic to expect a learned model to predict accurately under any form of distribution shift due to the combinatorial nature of real world data and tasks.

On the other hand, adapting a model to a specific type of distribution shift might be more approachable than adapting to any potential distribution shift scenarios, under appropriate assumptions and with appropriate modifications. By knowing where the model can predict well, one can use the model to make conservative predictions or decisions, and to guide future active data collection to covered shifted distributions. Therefore, in addition to improving the generalisation performance of models in general, methods that explicitly deal with the presence of distribution shift are also desirable for machine learning to be used in practice [14].

In terms of assessment, the robustness of learning models to distributional shift is typically assessed via metrics of predictive performance on a particular task: given two (or more) evaluation sets, where one is considered matched to the training data and the other(s) shifted, models which have a smaller degradation in performance on the shifted data are considered more robust. The quality of uncertainty estimates is often assessed via the ability to classify whether an example came from the “in-domain” dataset or a shifted dataset using measures of uncertainty.

For its part, concept shift (or concept drift) is different from distributional shift in that it is not related to the input data or the class distribution but instead is related to the relationship between two or more dependent variables. An example may be the customer purchasing behavior over time in a particular online shop. This behaviour may be influenced by the strength of the economy, this being not explicitly specified in the data. In this case, the concept of interest (consumer behaviour) depends on some hidden context, not known a priori, and not given explicitly in the form of predictive features, making the learning task more complicated [15]. In this sense, concept shift can be categorised into 3 types:

  1. sudden, abrupt or instantaneous concept shift (e.g., following the previous example, the COVID-19 lockdowns significantly changed customer behaviour);

  2. gradual concept shift (e.g., customers are influenced by wider economic issues, unemployment rates or other trends) which can be divided further into moderate and slow drifts, depending on the rate of the changes [16];

  3. cyclic concept drifts, where hidden contexts may be expected to recur due to cyclic phenomena, such as seasons of the year or may be associated with irregular phenomena, such as inflation rates or market mood [17].

Concept drift may be present on supervised learning problems where predictions are made and data is collected over time. These are traditionally called online or incremental learning problems [18], given the change expected in the data over time. For its part, the common methods for detecting concept drift in machine learning generally include ongoing monitoring of the performance (e.g., accuracy) and confidence scores of a learning model. If average performance or confidence deteriorates over time, concept shift could be occurring

Bibliography#

1

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. 2016.

2

Leslie David. Understanding artificial intelligence ethics and safety. The Alan Turing Institute, 2019. URL: https://doi.org/10.5281/zenodo.3240529.

3

Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. Mit Press, 2008.

4

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. Advances in neural information processing systems, 2006.

5

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, 1050–1059. PMLR, 2016.

6

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, 5389–5400. PMLR, 2019.

7

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arxiv 2013. arXiv preprint arXiv:1312.6199, 2013.

8

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

9

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.

10

Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.

11

Kunihiko Fukushima and Sei Miyake. Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pages 267–285. Springer, 1982.

12

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

13

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 2017.

14

Yifan Wu. Learning to Predict and Make Decisions under Distribution Shift. PhD thesis, University of California, 2021.

15

Alexey Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 106(2):58, 2004.

16

Kenneth O Stanley. Learning concept drift with a committee of decision trees. Informe técnico: UT-AI-TR-03-302, Department of Computer Sciences, University of Texas at Austin, USA, 2003.

17

Michael Bonnell Harries, Claude Sammut, and Kim Horn. Extracting hidden context. Machine learning, 32(2):101–126, 1998.

18

Gregory Ditzler and Robi Polikar. Incremental learning of concept drift from streaming imbalanced data. IEEE transactions on knowledge and data engineering, 25(10):2283–2301, 2012.

This entry was written by Jose Hernandez-Orallo, Fernando Martinez-Plumed, Santiago Escobar, and Pablo A. M. Casares.


1

Definition taken from [1] under Creative Commons Attribution License 4.0.