Data Drift in Machine Learning - what is it and why it's important to monitor
Mar 01, 2025
Estimated reading time - 6 minutes
What is Data Drift?
When the model is trained and deployed, it starts seeing real world production data. In an ideal scenario, this data has a similar distribution that the model has been trained on. However, in most of the cases, this data changes over time.
What does this change exactly mean? In simple terms, this change is the change of statistical properties of an input parameter distribution such as mean, median or standard deviation.
Such statistical properties changes are called Data Drift. A simple data drift illustration is shown below. Mathematically, this is often defined as P(X) which is the data probability.
It's important to emphasize that in most cases the Data Drift is referred as the statistical properties change of the INPUT DATA (X), i.e. drift of MODEL FEATURES. In this article, we also refer to this Data Drift definition.
Real world data drift example
Imagine we have a large scale coffee roasting process plant shown below.
To control and optimize this plant, we build a Machine Learning model that predicts the outlet humidity based on the measured data, inlet bean size and applied roasting temperature (shown below). Then, this model is used to monitor the outlet humidity and keep it within the 10-15% range by changing the value of the applied temperature. Below is the ML model setup.
What is data drift in this case?
It is a possible shift in the inlet bean size.
Why can this happen?
It's possible that the coffee beans supplier changes and even though the type of the coffee is still the same, the beans might have a larger size because they are grown in a different region (e.g. Columbia instead of Brazil).
Why is it important to monitor data drift?
SCENARIO 1: Data drift monitoring helps to estimate the model performance when true target values are rarely available.
Imagine the situation shown below. Because the target is not continuously available to us, we can't continuously monitor the model error.
In this particular case, when we collect 4 target samples, we see that the mean values of the outlet humidity increased and locally the model error has increased slightly.
However, often we would make our decision about model accuracy based on a larger time window, e.g. 10 days. So, we might not even re-train the model and wait for more samples.
What if we have a data monitoring system?
If we have a data monitoring system, we can detect this problem early as shown below.
This means that once data drift detected, we can:
- Decide if we can still use this model for decision making
- In case we can't use the model, decide how we can adjust it
- Understand what to do not to waste the final product
Scenario 2: Understanding the root cause of a bad model performance
Another case is when we observe the model performance drop and we want to understand the reason behind it.
In this particular case, there might be several reasons for a model performance drop:
- Change in statistical properties of the model input
- Change in the target measurement (e.g. humidity sensor got broken)
- Change in the process measurement (e.g. temperature sensor got broken)
However, if we detect the data drift, then we can:
- Understand that the most likely reason for a poor model performance is the data drift.
- Quickly decide on the strategy of what to do not waste the final product
Scenario 3: What if model is correct but the humidity lab samples data is wrong?
It can also happen that the values that we consider as ground truth are wrong. This can happen because of the human mistake or because the sensors (or other equipment, e.g. cameras) got biased.
In our particular case, it can easily happen that the humidity sensor got degraded. Then, we will observe high error but the data monitoring system says it's not because of the input data drift (see the illustration below)
In this case, because the data drift is not detected, we can conclude that the actual lab samples might be wrong because the lab equipment got broken.
Without data monitoring, this case would be the last one to consider because usually the target is considered as correct.
Summary
Data Drift is a shift in statistical properties of the model input data (features) over time, affecting model performance. Continuous data drift monitoring is an important part of ML systems because:
- It helps to estimate the model performance when true target values are rarely available
- It helps to understand the root cause of a bad model performance
In the next blog articles, we will closely look at different ways to monitor and detect data drift.