When should you re-train?
Theoretically, a model will only degrade (become outdated and no longer useful) if the system you are modelling or the nature of the data has changed. Ideally you can spot this by setting up automated monitoring of the model in production. This could mean that predictions on new incoming data will be compared with the ground-truth data and you will be alerted if your error metric exceeds your desired range. Or it could mean you keep tabs on an indirectly related KPI, and if it exceeds your desired range, you must reevaluate whether the model is still serving your cause. If your model is no longer so useful, it is time to re-train, and the same best practices should be followed as when you created the original model, particularly with regards to model validation.
There is no reason to re-train if the error metric / KPI stays within your desired range (if your model is serving its purpose). There is no benefit to incrementally updating a model when the nature of the data and the system being modeled have not changed, but there are downsides:
- increased cloud computation costs,
- loss of "unseen" data that could be used for model validation,
- unnecessary extra work for the data scientist who must validate these new models.
What are best-practices for re-training?
It is risky to set up this re-training in an automated fashion because automated model training cannot yet produce models which match the quality and reliability of human-validated models.
Proper model validation cannot be done in absentia. Ideally, it looks like a semblance of the following (depends on the type of model you are building):
- ensure the data still meets the assumptions of the algorithm (e.g. for linear regression, is the Y-variable normally distributed? Are the errors independently scattered about the mean? Etc.)
- train/validation/test set (keep an eye on over-fitting)
- use of cross-validation and/or bootstrapped samples
- validation of key model metrics (i.e. error, accuracy, F-value, p-value, etc.)
- comparison of model scores (e.g. accuracy) with an ANOVA F-statistic to determine whether there is a statistically significant difference between models (bonus points if those scores are averaged CV scores for each model)
- approximation and evaluation of a confidence interval for model score (e.g. "the 95% CI for the accuracy of this model is within range [78.04%, 79.60%]")
- use of an ROC curve to compare models
- a cost/benefit analysis of the best models:
- time to train model
- time to query model
- scrutability (can I easily explain to stakeholders how this model works?)
- interpretability (can I easily see which factors are deemed important and actionable by this model?)
If you believe your model will benefit from incremental updates, that implies the underlying system or nature of the data is changing over time, and thus the model you validated yesterday is not necessarily valid today. You should not assume that simply optimizing the parameters on the old model (given new data) will produce a statistically significant improvement in your model, or a valid model for that matter. Such incremental updates should be done with proper model validation overseen by a data scientist. This is because validating a model is a complex task, and semi-qualitative challenges will likely arise. For example:
- it could be the case that some predictor variables are no longer relevant and should be removed
- maybe the nature of data fundamentally changed after 2015 so the training set should be filtered
- perhaps some new data needs to be collected to better reflect the system being modeled
- a change reflected in the new data could introduce multicollinearity into the model, violating a model assumption and thereby invalidating it
- maybe the algorithm or general approach needs to be changed altogether
When can re-training be automated?
Of course, you can (caution) apply machine learning with loose regard for the fundamental assumptions and concepts. This can be a matter of cost/benefit in your application area.
For example, it could be beneficial to set up automated re-training when you have tens of thousands of models that need to be updated on a regular basis and the stakes on the predictions are low. For example you could have a model for each individual user of an application, and you just want to predict some semi-trivial behavior. Perhaps the cost of an inaccurate prediction is low, but the benefit of an accurate prediction is high. This could make good business sense.
But if the stakes are high on your predicted outcome, I would apply the fundamental concepts and follow the best-practices to the letter.