Continuous Performance Monitoring#

In brief#

Continuous performance monitoring is the activity to track, log and monitor over time the behaviour and the performance of Artificial Intelligence and Machine Learning models. This activity is particularly relevant after in-production deployment in order to detect any performance drifts and outages of the model.

More in Detail#

Monitoring the live functioning of a produtionalised ML/AI model or system is an emergent topic that is gaining increasing attention as more and more methods are being deployed in industrial, commercial and public sectors. As any other piece of software, any tool based on AI/ML needs to be maintained over time, for fixing bugs and ensuring quality. ML models and systems require specific strategies that take into account their nature of learning from data.

Idealistically, the behaviour of ML models trained on sample well-curated data is expected to generalise on new, unseen data in the post-deployment phase. Nonetheless, this happens rarely in practice, and a model’s performance assessed live is often different from the performance evaluated offline during development. Furthermore, it is well-known that the performance of an AI model or system degrades over time.

Several phenomena have been identified as drivers of this decay. The input data fed into the ML model may contain unexpected patterns not present in the training datasets. Moreover, the characteristics of data may change over time, causing that the relationships at the core of the ML methods do not stand valid any more.

This phenomenon, termed concept or model drift [1], can lead the model to make wrong predictions. Additionally, if the nature (or distribution) of the input data become vastly different with respect to those used for training, the performance can even drop below acceptance. This phenomenon is known as covariate shift [2]. Performance degradation can also result from the impact that the same deployed ML model may have on the decision process that it supports. The ML model may influence other elements involved in the decision or induce an overall change in the phenomenon that is being modelled, which was not taken into account during training.

Overall, after its deployment, an ML method can come across several difficulties and changes that the level of efforts and skills needed in its maintenance could be an order of magnitude higher than that needed in model building.

Given these concerns, several strategies and best practices have been investigated to monitor the behaviour of ML methods after deployment, also in relation to any consequence the methods can have. The first work published in 2015 described the various challenges that ML methods raise after deployment in relation to data dependencies, model complexity, reproducibility, testing, and changes in the external world [3]. After that, several methods have been presented in the literature, focusing specifically on data [4], on the role of humans in ML deployment [5], on testing strategies [6], or the definition of a general framework to track ML methods in their live functioning (e.g., pipelines, datasets, execution configurations, code and human actions) [7].

Overall, the best practices, promoted also from industrial actors [8, 9], include a continuous monitoring of the ML system to assess its quality and “vitality”. Various types of metrics are suggested in this respect, focusing mainly on performance evaluation. The idea is to detect changes in the behaviour and then act via re-training or implementing an active learning approach (when reinforcement learning is adopted), so as to rectify any wrong conduct. It should be noted that model maintenance can be seen as nurturing the model, as it can take advantage of the new knowledge coming from the real-setting scenario, thus it can produce an improvement of the original version released.

Monitoring and maintenance can be performed in a proactive or reactive fashion. Proactive monitoring works to identify the input samples that deviate significantly from the patterns seen in the training phase and to analyse them more in detail to understand any drifts. The reactive approach entails detecting a wrong output and identifying its causes, so as to understand how the method can be rectified.

The Continuous Delivery [10] and DevOps [11] approaches have been also proposed to better manage the risks of releasing changes to Machine Learning applications and, then, do them in a safe and reliable way.

Bibliography#

1: Alexey Tsymbal. The problem of concept drift: definitions and related work. 2004.
2: Masashi Sugiyama and Motoaki Kawanabe. Machine Learning in Non-Stationary Environments. MIT Press, 2012. ISBN 9780262017091.
3: D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL: https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf.
4: Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17, 1723–1726. New York, NY, USA, 2017. Association for Computing Machinery. URL: https://doi.org/10.1145/3035918.3054782, doi:10.1145/3035918.3054782.
5: Ilias Flaounas. Beyond the technical challenges for deploying machine learning solutions in a software company. 2017. URL: https://arxiv.org/abs/1708.02363, doi:10.48550/ARXIV.1708.02363.
6: Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D. Sculley. What’s your ml test score? a rubric for ml production systems. In NIPS 2016 Workshop (2016). 2016.
7: Vinay Sridhar, Sriram Subramanian, Dulcardo Arteaga, Swaminathan Sundararaman, Drew Roselli, and Nisha Talagala. Model governance: reducing the anarchy of production ML. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), 351–358. Boston, MA, jul 2018. USENIX Association. URL: https://www.usenix.org/conference/atc18/presentation/sridhar.
8: Accenture. Model behavior. nothing artificial. 2017.
9: SaS. Machine learning model governance. white paper. 2019.
10: Wolff. A Practical Guide to Continuous Delivery. Addison-Wesley, 2017. ISBN.
11: Loukides. MWhat is DevOps? O'Reilly Media, 2012. ISBN.

This entry was written by Sara Colantonio.

The TAILOR Handbook of Trustworthy AI

Continuous Performance Monitoring

Contents

Continuous Performance Monitoring#

In brief#

More in Detail#

Bibliography#