## Should a model be re-trained if new observations are available?

46

43

So, I have not been able to find any literature on this subject but it seems like something worth giving a thought:

• What are the best practices in model training and optimization if new observations are available?

• Is there any way to determine the period/frequency of re-training a model before the predictions begin to degrade?

• Is it over-fitting if the parameters are re-optimised for the aggregated data?

Note that the learning may not necessarily be online. One may wish to upgrade an existing model after observing significant variance in more recent predictions.

1The answer is highly dependent on the business domain and particular model application. – Pete – 2019-10-22T18:59:20.587

26

1. Once a model is trained and you get new data which can be used for training, you can load the previous model and train onto it. For example, you can save your model as a .pickle file and load it and train further onto it when new data is available. Do note that for the model to predict correctly, the new training data should have a similar distribution as the past data.
2. Predictions tend to degrade based on the dataset you are using. For example, if you are trying to train using twitter data and you have collected data regarding a product which is widely tweeted that day. But if you use use tweets after some days when that product is not even discussed, it might be biased. The frequency will be dependent on dataset and there is no specific time to state as such. If you observe that your new incoming data is deviating vastly, then it is a good practise to retrain the model.
3. Optimizing parameters on the aggregated data is not overfitting. Large data doesn't imply overfitting. Use cross validation to check for over-fitting.

So if the nature of the data-set coming-in remains consistent throughout, there is nothing new that the model can learn? – 119631 – 2016-07-13T12:49:30.023

If the data doesnt change and if you are happy with the accuracy of the current model, I see no point in retraining it. – Hima Varsha – 2016-07-13T12:52:52.323

@Aayush, Maybe you can use the incoming data as validation set and check your current model. – Hima Varsha – 2016-07-13T12:54:20.283

Still too early to accept, but I will. Thanks! – 119631 – 2016-07-13T12:57:54.553

Hello @tktktk0711, I do not have a code currently to show you. But just go through this which points out to another link with the code. https://github.com/tflearn/tflearn/issues/39

– Hima Varsha – 2017-08-18T14:41:37.693

Hi @Hima Varsha thanks for your kind answer. I want to ask you a question about when to retrain model. Are there some method can automatically retrain model, such as automatically notice users that should retrain(supervising learning model, random forest or gbdt), I mean not online learning. I found that most of methods periodically retrain model. Do you know some literature about this issue(automatically retrain). – tktktk0711 – 2017-08-21T07:48:11.313

@HimaVarsha : I am using LinearSVC model in which warm_start parameter is not given. Is it still possible to fit the model again using new data and having old data fit already? I am saving the model as a pickle. – Sumit S Chawla – 2018-06-06T08:44:09.097

@HimaVarsha Regarding the point 2, how do you determine whether or not the new incoming data is deviating vastly from the old one? Do you do any statistical test? I thought to perform a non-parametric statistical test (like Wilcoxon-mann-Whitney for example) to compare each feature between the old and the new data, but was't sure that it is the right way to do... Thanks! – Inna – 2020-08-02T14:27:57.510

34

When new observations are available, there are three ways to retrain your model:

1. Online: each time a new observation is available, you use this single data point to further train your model (e.g. load your current model and further train it by doing backpropagation with that single observation). With this method, your model learns in a sequential manner and sort of adapts locally to your data in that it will be more influenced by the recent observations than by older observations. This might be useful in situations where your model needs to dynamically adapt to new patterns in data. It is also useful when you are dealing with extremely large data sets for which training on all of it at once is impossible.
2. Offline: you add the new observations to your already existing data set and entirely retrain your model on this new, bigger data set. This generally leads to a better global approximation of the target function and is very popular if you have a fixed data set, or if you don't have new observations to often. However it is unpractical for large data sets.
3. Batch/mini batch: this is sort of a middle ground approach. With batch, you wait until you have a batch of $n$ new observations and then train your already existing model on this whole batch. It is not offline as you are not adding this batch to your preexisting data set and then retraining your model on it and it is not online as your are training your model on $n$ observations at once and not just a single one. So it's a bit of both :) Mini batch is exactly the same except that the batch size is smaller so it tends towards online learning. Actually online learning is just batch with batch size 1 and offline is batch with batch size the size of the whole data set.

Most models today will use batch/mini batch and the choice for the size of the batch depends on your application and model. Choosing the right size batch is equivalent to choosing the right frequency with which to re-train your model. If your new observation have a low variance with your existing data, I'd suggest larger batches (256-512 maybe) and if on the contrary new observations tend to vary greatly with your existing data, use small batches (8-256). At the end of the day, batch size is kind of like another hyper-parameter which you need to tune and which is specific to your data

Hi, I want to ask you online: as online for new data. Does this method limit some ML Model, I mean not the whole machine learning. – tktktk0711 – 2017-08-18T08:01:09.110

Do you know of any tensorflow examples that use batch? – maxisme – 2018-03-05T18:17:11.343

3

## When should you re-train?

Theoretically, a model will only degrade (become outdated and no longer useful) if the system you are modelling or the nature of the data has changed. Ideally you can spot this by setting up automated monitoring of the model in production. This could mean that predictions on new incoming data will be compared with the ground-truth data and you will be alerted if your error metric exceeds your desired range. Or it could mean you keep tabs on an indirectly related KPI, and if it exceeds your desired range, you must reevaluate whether the model is still serving your cause. If your model is no longer so useful, it is time to re-train, and the same best practices should be followed as when you created the original model, particularly with regards to model validation.

There is no reason to re-train if the error metric / KPI stays within your desired range (if your model is serving its purpose). There is no benefit to incrementally updating a model when the nature of the data and the system being modeled have not changed, but there are downsides:

1. increased cloud computation costs,
2. loss of "unseen" data that could be used for model validation,
3. unnecessary extra work for the data scientist who must validate these new models.

## What are best-practices for re-training?

It is risky to set up this re-training in an automated fashion because automated model training cannot yet produce models which match the quality and reliability of human-validated models.

Proper model validation cannot be done in absentia. Ideally, it looks like a semblance of the following (depends on the type of model you are building):

• ensure the data still meets the assumptions of the algorithm (e.g. for linear regression, is the Y-variable normally distributed? Are the errors independently scattered about the mean? Etc.)
• train/validation/test set (keep an eye on over-fitting)
• use of cross-validation and/or bootstrapped samples
• validation of key model metrics (i.e. error, accuracy, F-value, p-value, etc.)
• comparison of model scores (e.g. accuracy) with an ANOVA F-statistic to determine whether there is a statistically significant difference between models (bonus points if those scores are averaged CV scores for each model)
• approximation and evaluation of a confidence interval for model score (e.g. "the 95% CI for the accuracy of this model is within range [78.04%, 79.60%]")
• use of an ROC curve to compare models
• a cost/benefit analysis of the best models:
• time to train model
• time to query model
• scalability
• scrutability (can I easily explain to stakeholders how this model works?)
• interpretability (can I easily see which factors are deemed important and actionable by this model?)
• updatability
• etc.

If you believe your model will benefit from incremental updates, that implies the underlying system or nature of the data is changing over time, and thus the model you validated yesterday is not necessarily valid today. You should not assume that simply optimizing the parameters on the old model (given new data) will produce a statistically significant improvement in your model, or a valid model for that matter. Such incremental updates should be done with proper model validation overseen by a data scientist. This is because validating a model is a complex task, and semi-qualitative challenges will likely arise. For example:

• it could be the case that some predictor variables are no longer relevant and should be removed
• maybe the nature of data fundamentally changed after 2015 so the training set should be filtered
• perhaps some new data needs to be collected to better reflect the system being modeled
• a change reflected in the new data could introduce multicollinearity into the model, violating a model assumption and thereby invalidating it
• maybe the algorithm or general approach needs to be changed altogether

## When can re-training be automated?

Of course, you can (caution) apply machine learning with loose regard for the fundamental assumptions and concepts. This can be a matter of cost/benefit in your application area.

For example, it could be beneficial to set up automated re-training when you have tens of thousands of models that need to be updated on a regular basis and the stakes on the predictions are low. For example you could have a model for each individual user of an application, and you just want to predict some semi-trivial behavior. Perhaps the cost of an inaccurate prediction is low, but the benefit of an accurate prediction is high. This could make good business sense.

But if the stakes are high on your predicted outcome, I would apply the fundamental concepts and follow the best-practices to the letter.

1

Your problem comes under the umbrella of Online Learning methods. Assuming a stream of data coming, you can use Stochastic Gradient Descent method to update your model parameters using that single example.

If your cost function is :

$\min_\theta J(x,y,\theta)$ ,

where $\theta$ is parameter vector, then assuming a streaming data of form ($x^{i}, y^{i}$), you can update your parameter vector using SGD with the following update equation :

$\theta^{t} = \theta^{t-1} - \nabla_\theta J(x^{i}, y^{i})$.

This is essentially SGD with batch size 1.

There is one other trick, you can adopt a window/buffer based method, where you buffer some examples from stream and treat it as batch and use batch SGD. In that case the update equation will become:

$\theta^{t} = \theta^{t-1} - \sum_{i} \nabla_\theta J(x^{i}, y^{i})$.

This is essentially mini-batch SGD.

1

The question: SHOULD you retrain?

The answer depends on what your model attempts to do and in what environment it is applied.

Let me explain by a couple of examples:

Suppose that your model attempts to predict customers' behaviour, e.g. how likely is a customer to purchase your product given an offer tailored for him. Clearly, the market changes over time, customers' preferences change, and your competitors adjust. You should adjust as well, so you need to retrain periodically. In such a case I would recommend to add new data, but also omit old data that is not relevant anymore. If the market is fast changing, you should even consider retraining periodically based on new data only.

On the other hand, if your model classify some imaging (e.g x-ray or MRI) to medical conditions, and the model performs well, you do not need to retrain if there is no change in the technology or in the medical know-how. Adding more data will not improve much.