For the past few years the data community has been focusing on a very important task in Data Science which is building great ML models. As time goes by the industry evolves, new challenges arise and heads turn over putting ML models into production. Companies now want to use their models in actual business applications to get business value out of them via data workflows. But with that there are many challenges to solve, for example Model Drift.
When a machine learning model is deployed into production, the main concern of data scientists is the model pertinence over time. Is the model still capturing the pattern of new incoming data, and is it still performing as well as during its design phase?
Let's go over a simple example. Let’s say a company that owns different Fast Food restaurants creates a model to predict inventory. Inventory of french fries, lettuce, etc. The team is happy with the metrics of the model and decides to put it into production. It works well for a few days but then, unexpectedly COVID-19 hits. This means that the distribution of the data changed (compared to training data). This is called data drift.
Monitoring model performance Drift is a crucial step in production ML; however, in practice, it proves challenging for many reasons, one of which is the delay in retrieving the labels of new data. Without ground truth labels, drift detection techniques based on the model’s accuracy are off the table.
There’s much documentation on how to reduce or eliminate drift, but no real tools and step by step examples that can address this and the reason is because to solve the problem, end-to-end workflows need to be built and building them at scale is one of the biggest challenges in Data Science. I want to emphasize that putting ML into production is not just making an API endpoint of a model available. Putting ML into production is an end-to-end workflow that goes from ingesting data up to sending the output to a specific business app.
Let’s start with the main reasons of why a end-to-end workflow can eliminate Drift:
- Periodically Re-Fit
A good first-level intervention to solve Drift is to periodically update your model with more recent historical data.
An end-to-end pipeline will solve this issue by ingesting new data and feeding the algorithm. For example, perhaps you can update the model each week or each month with the data collected from the data source.
You can do this by setting a condition and if the model is Drifting then you can trigger a re-train of your model.
- Weight Data
Some algorithms allow you to weigh the importance of input data.
In this case, if you have a pipeline you can set alerts to understand if you should use a weighting that is inversely proportional to the age of the data such that more attention is paid to the most recent data (higher weight) and less attention is paid to the least recent data (smaller weight).
- Detect and Choose Model
In a data workflow you can have A/B testing of different models so that if there is drift a re-training is triggered in all models and then all the models are analyzed to understand if a new algorithm has a better performance. In that case a new algorithm can be used rather than the one that was previously used.
In Datagran, we specialize in making it extremely easy for data professionals to ingest data, run their models and send the output to business applications via end-to-end workflows. Within our workflows we are also making it extremely easy for data practitioners to reduce Drift by retraining periodically when given rules are triggered (example; if the accuracy is below a threshold). Here’s a step by step guide on how you can create your ML workflow and set your conditions to re-train your model when Drift is detected:
- Connect your data via Datagran’s integrations.
- Upload your model(s)
For documentation on how to upload your model go here.
- Set the condition to re-train your model if the condition is met.
- Connect the output of your algorithms to an SQL Operator.
- Set the condition in your SQL to detect the algorithm with the best metric. Notice that this can change every time the model is re-trained based on the condition of step “c”.
- Connect Slack and trigger a message alerting for the best new algorithm.
- Launch the new algorithm into production connecting your favorite app. In this example is just a REST API.
Reducing Drift is currently a pain point because the disparate tools make it extremely difficult to automate an end-to-end workflow. Uploading a model and making it available via a REST API is not an end-to end solution. With today’s technology, reducing drift should be one less challenge in your data execution and real end-to-end solutions can provide the answer.