Monitoring machine learning models in production



I am looking for tools that allow me to monitor machine learning models once they are gone to production. I would like to monitor:

  1. Long term changes: changes of distribution in the features with respect to training time, that would suggest retraining the model.
  2. Short term changes: bugs in the features (radical changes of distribution).
  3. Changes in the performance of the model with respect to a given metric.

I have been looking over the Internet, but I don't see any in-depth analysis of any of the cases. Can you provide me with techniques, books, references or Software?

David Masip

Posted 2019-12-13T12:15:59.717

Reputation: 5 101



The changes in distribution with respect to training time are sometimes referred to as concept drift.

It seems to me that the amount of information available online about concept drift is not very large. You may start with its wikipedia page or some blog posts, like this and this.

In terms of research, you may want to take a look at the scientific production of João Gama, or at chapter 3 of his book.

Regarding software packages, a quick search reveals a couple of python libraries on github, like tornado and concept-drift.

Update: recently I came across ML-drifter, a python library that seems to match the requirements of the question for scikit-learn models.


Posted 2019-12-13T12:15:59.717

Reputation: 10 494

+1, "concept drift detection" is probably a phrase to be searching for. As far as tools, I get the feeling most companies will have built in-house solutions so far; but now the DS-as-a-service companies are starting in, see AWS SageMaker Model Monitor. – Ben Reiniger – 2019-12-23T14:57:02.777

I finally got around to looking at the 2014 survey by Gama; it's really great. – Ben Reiniger – 2020-01-03T17:03:58.673


While reading this Nature paper:Explainable AI for Trees: From Local Explanations to Global Understanding. The section 2.7.4 "Local model monitoring reveals previously invisible problems with deployed machine learning models", says the following:

Deploying machine learning models in practice is challenging because of the potential for input features to change after deployment. It is hard to detect when such changes occur, so many bugs in machine learning pipelines go undetected, even in core software at top tech companies [78]. We demonstrate that local model monitoring helps debug model deployments by decomposing the loss among the model’s input features and so identifying problematic features (if any) directly. This is a significant improvement over simply speculating about the cause of global model performance fluctuations

Then they do 3 experiments with the Shapley values provided by the TreeExplainer

  1. We intentionally swapped the labels of operating rooms 6 and 13 two-thirds of the way through the dataset to mimic a typical feature pipeline bug. The overall loss of the model’s predictions gives no indication that a problem has occurred (Figure 5A), whereas the SHAP monitoring plot for room 6 feature clearly shows when the labeling error begins

  2. Figure 5C shows a spike in error for the general anesthesia feature shortly after the deployment window begins. This spike corresponds to a subset of procedures affected by a previously undiscovered temporary electronic medical record configuration problem (Methods 17).

  3. Figure 5D shows an example of feature drift over time, not of a processing error. During the training period and early in deployment, using the ‘atrial fibrillation’ feature lowers the loss; however, the feature becomes gradually less useful over time and ends up hurting the model. We found this drift was caused by significant changes in atrial fibrillation ablation procedure duration, driven by technology and staffing changes

Current deployment practice is to monitor the overall loss of a model over time, and potentially statistics of input features. TreeExplainer enables us to instead directly allocate a model’s loss among individual features

Carlos Mougan

Posted 2019-12-13T12:15:59.717

Reputation: 4 420


What you're describing is known as concept drift and there are quite a few software startups bringing a solution to market (us included - happy to show you what we have).

  1. A very simplistic way of detecting drift is monitoring the differences between distributions of the predicted dataset and the training dataset using a Kolmogorov-Smirnov test or Wasserstein distance.

  2. For radical changes in distribution, what you might do is create a model to understand the datasets unique patterns and have an outlier detector to determine true radical changes to the distribution as opposed to also identifying false positives.

  3. This is an interesting use case - are you able to share an example?


Posted 2019-12-13T12:15:59.717

Reputation: 21


If I understand your query correctly,you are looking for MLFLOW where you can track your experimentation and vizualize them using APIs MLFLOW


Posted 2019-12-13T12:15:59.717

Reputation: 339


You can have a look at Anodot's MLWatcher. Few of the highlights of this tool are as follows.

  • MLWatcher collects and monitors metrics from machine learning models in production.
  • This open source Python agent is free to use, simply connect to your BI service to visualize the results.
  • To detect anomalies in these metrics, either set rule-based alerting, or sync to an ML anomaly detection solution, such as Anodot, to execute at scale.
  • Distribution of input features.

You can have a look at their complete features here.


Posted 2019-12-13T12:15:59.717

Reputation: 101


There is all kinds of solutions right now. Mainly you can divide it into two:

  1. Monitoring features as part of a bigger AI platform
  2. A dedicated monitoring solution

Few factors to examine before choosing between the two options:

  • What is your scale of your usage in ML models?
  • What's the impact of your models? are they part of your core business or is it only enrichment \ niche of your business?
  • What is your DS team size?
  • How many platforms do you use to deploy models to production? Do you have only one standard way to deploy?

The general theme is, the bigger the ml operation is, and the more you need it to be agnostic to the deployed platform, go for a dedicated solution. If your ml operations are still very limited and your serving platform already has few monitoring features in place, so it might good enough for you for now.

When examining a specific solution, consider the following points:

Integration - How complicated is it?

Measurement - Does it offer both data (input \ inference \ label) stability measurement?

Performance analysis - Does it provide you the ability to close the loop and see performance analytics (BTW... in most cases, even if you can get performance metrics, you probably won't be able to base your monitoring on top of it, cause in reality such performance information usually available only with delay time after the predictions were made).

Resolution - Can the system detect and measure such metrics on a higher resolution? (sub-segments of your entire datasets)? In many cases, drift or technical issues will occur only for a specific subset of your data.

Alerts - Does the solution include also a statistical alert mechanism? Eventually, it's hard to track all the KPIs mentioned above, and every dataset behaves differently, so thresholds are hard to define.

Dashboard - Does the solution contain a clear UI dashboard?

API - Can you consume such production insights directly from API? It can be very beneficial to build automation on top of it.

BTW... Here is a blog post, I wrote, talking about the different elements that should be converted when monitoring ml and reviewing current solutions

Oren Razon

Posted 2019-12-13T12:15:59.717

Reputation: 46


Something like this:

All credits goes to Abishek in any case as long as you save your desired differences, metrics, changes locally in your source code, you can hook it up with slack and receive messages. So in your cases for all 3 things its totally doable, you just have to get your hands dirty a bit (looks like a great hobby project!)

enter image description here

Noah Weber

Posted 2019-12-13T12:15:59.717

Reputation: 4 932